AD-A099  «M 

UNCLASSIFIED 


CARNC8IC*MCLL0N  UN IV  PITTSWAOH  FA  DC^T  OF  ELECTAICA— CTe  r/« 

rsKsr6*  i’?,jEvii!,L,TT  *  »«•««[  e«Sw5S5 

APR  81  X  CASTILLO*  D  P  SICHIOFCK  OAS860-80-C-OOS? 


ADA099482 


LEVEL. 


V  Workload,  Performance, 
and  Reliability 

of  Digital  Computing  Systems . 


4k 


{J ijJ  Xavier/Castillo 
Daniel  P.jSiewiorek 

~w 


y  6Aprl 


DT1C 


f  INAL  REPORT  . 


Synthesis  of  Fault  Tolerant 
Distributed  Computing  Systems 


Ballistic  Missile  Defense  Syg»Qmc 
Contract  noi  DASG6C 


a¥Stam&£aaunand 
3p-y-C-^57  py* 


Carnegie- Mellon  University 
Departments  of  Electrical  Engineering 
■and  Computer  Science 


pig  The  views,  opinions,  and/or  findings  contained  in  this  report  are  those  of  the  authors  and  should  not 
be  construed  as  an  official  Department  of  the  Army  position,  policy,  or  decision,  unless  so  designated 
by  other  official  docu  mentation . 


*  Additional  support  from  the  Fundacion  I.T.P.,  Madrid,  Spain. 


//.  ;;  </  /~ 

61  4  24  0  77  // 


Abstract 


— -v 

In  this  paper  a  new  modeling  methodology  to  characterize  failure  processes  in  Time-Sharing 
systems  due  to  hardware  transients  and  software  errors  is  summarized.  The  basic  assumption  made 
is  that  the  instantaneous  failure  rate  of  a  system  resource  can  be  approximated  by  a  deterministic 
function  of  time  plus  a  zero-mean  stationary  Gaussian  process,  both  depending  on  the  usage  of  the 
resource  considered.  The  probability  density  function  of  the  time  to  failure  obtained  under  this 
assumption  has  a  decreasing  hazard  function,  partially  explaining  why  other  decreasing  hazard 
function  densities  such  as  the  Weibull  fit  experimental  data  so  well.  Furthermore,  by  considering  the 
Kernel  of  the  Operating  System  as  a  system  resource,  this  methodology  sets  the  basis  for 
independent  methods  of  evaluating  the  contribution  of  software  to  system  unreliability,  and  gives 
some  non  obvious  hints  about  how  systerh  reliability  could  be  improved.  A  real  system  has  been 
characterized  according  to  this  methodology,  and  an  extremely  good  fit  between  predicted  and 
observed  behavior  has  been  found.  Also,  the  predicted  system  behavior  according  to  this  methology 
is  compared  with  the  predictions  of  other  models  such  as  the  exponential,  Weibull,  and  periodic 
failure  rate. 
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1.  Introduction 


There  are  several  trends  in  Distributed  Data  Processing  (DDP)  Systems  that,  when  applied  to  the 
Ballistic  Missile  Defense  (BMD)  Task  make  reliability  and  fault  tolerant  requirements  not  only  desirble 
but  also  nexessary.  These  trends  include : 

•  Systems  are  becoming  more  complex.  Thus,  even  though  component  reliability  is 
improving,  the  total  system  reliability  may  be  unacceptably  low. 

•  BMD  systems  must  work.  Since  national  survival  may  depend  on  the  proper  functioning  of 
a  BMD  system  whose  capabilities  are  unused  except  at  the  moment  of  crisis,  the  BMD 
system  must  be  designed  to  detect  and  tolerate  failures  while  in  the  dormat  or  monitoring 
state. 

•  Repair.  System  maintenance  is  often  the  dominant  cost  in  the  system  life  cycle.  This  trend 
is  even  more  amplified  by  the  ever  shrinking  cost  of  hardware.  A  related  problem  is  the 
disparity  between  increasing  system  complexity  and  decreasing  repairman  skills.  Fault 
tolerance  and  Built  in  Tets  (BIT),  an  application  of  the  fault  detection  phase  of  fault 
tolerance,  can  help  reduce  repair  costs  and  repair  skill  levels. 

•  Transients.  Data  from  several  uniprocessor  [McConnel  79]  and  multiprocessor  [Siewiorek 
78]  systems  indicates  that  transient  failures  are  20  to  60  times  more  likely  than  hard 
failures.  Further,  transients  exhibit  a  strong  clustering  phenomenum.  That  is,  once  a 
transient  has  occured,  there  is  a  high  probability  that  another  transient  will  occur  in  a  very 
short  period  of  time.  This  clustering  might  overwhelm  fault  tolerant  techniques  designed 
for  hard  failure  survival  (i.e.,  a  second  transient  might  occur  before  that  the  system  can 
recover  from  what  it  expects  is  a  first  hard  failure).  Transients  will  become  more  of  a 
problem  with  shrinking  device  dimensions  where  smail  local  electrical  fields  can  cause 
devices  to  change  state.  Cases  where  cosmic  rays  and  background  radiation  in 
packaging  material  caused  transients  have  already  been  documented.  Transients 
occurrences  also  seem  related  to  system  load,  a  particularly  disastrous  feature  for  a 
BMD  system  which  must  respond  to  threats  with  peak  processing  power. 


Therefore,  fault-tolerance  cost  and  benefit  measures  are  needed.  To  evaluate  the  impact  of 
unreliability  on  the  performance  of  computing  systems,  a  knowledge  of  the  mechanisms  leading  to 
unreliable  system  behavior  is  required.  From  a  hardware  viewpoint,  transients  are  the  dominant 
cause  of  system  unreliability.  However,  unreliable  software  manifestations  are  almost 
indistinguishable  from  hardware  transients. 

A  methodology  capable  of  modeling  and  characterizing  the  impact  of  hardware  transients  and 
software  errors  on  the  performance  of  DDP  systems  must  be  developed.  Unfortunately,  not  that  many 
DDP  systems  are  available  for  general  use  and/or  experimentation.  And  of  the  DDP  systems  available, 
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none  has  the  necessary  instrumentation  tools  required  to  validate  a  possible  theoretical  model. 
Furthermore,  modeling  methods  for  hardware  transients  and  software  errors  is  in  a  very  primitive 
state.  Hence,  it  is  not  possible  to  attack  the  transient/software  error  problem  for  ODP  systems. 
Instead,  the  problem  has  to  be  satisfactorily  solved  for  simpler,  more  accessible  systems  such  as 
uniprocessor  time-sharing  systems. 

This  report  develops  a  methodology  and  a  model  for  hardware  transients  and  software  errors  on 
time-sharing  systems.  Tools  have  been  developed  to  gather  data  from  several  available  systems. 
Thus,  all  theoretical  results  are  compared  with  the  behavior  of  real  systems.  Some  of  the  results  can 
be  extended  to  DOP  systems.  In  any  event,  the  results  presented  in  this  report  are  a  necessary  step 
towards  the  characterization  of  the  effect  of  transients  and  software  errors  on  DDP  systems. 

1.1  Definitions 

The  following  concepts  need  to  be  precisely  defined : 

Hardware  Fault  Erroneous  state  of  hardware  due  either  to  failures  of  components  or  to  physical 
interference  from  the  environment. 

Hardware  Error  Manifestation  of  a  hardware  fault  within  a  program  or  data  structure. 

Permanent  Hardware  Fault 

Hardware  fault  which  is  continuous  and  stable,  reflecting  an  irreversible  physical 
change  in  the  hardware. 

Transient  Hardware  Fault 

Hardware  fault  due  to  temporary  environmental  conditions. 

Software  Fault  Imperfection  in  the  design  or  implementation  of  a  software  module  such  that  upon 

some  timing  or  value  conditions  in  its  input  data  stream  it  fails  to  accomplish  its 
designed  task. 

Software  Error  Manifestation  of  a  software  fault  within  a  program  or  data  structure. 

System  Failure  Manifestation  of  software  or  hardware  errors  that  force  an  entire  computing 

system  to  suspend  its  operation. 

Since  no  repair  takes  place  after  system  failures  due  to  software  faults  or  transient  hardware  faults, 
the  time  of  system  failure  is  essentially  equal  to  the  system  restart  time.  Since  this  report  is  concerned 
solely  in  modelling  hardware  transient  faults  and  software  faults,  the  words  "system  failure"  and 
"system  restart"  will  be  used  interchangeably  to  describe  the  same  event  in  time. 
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1 .2  The  problem  of  Characterizing  System  Reliability 


Fault-tolerance  has  traditionally  been  characterized  by  relatively  simple  functions  based  on  strict 
assumptions.  The  Reliability  function  R(t)  is  defined  as  the  probability  of  uninterrupted  operation  up 
to  time  t  given  that  ail  hardware  was  correctly  operating  at  time  t  =  0.  R(t)  may  be  used  to  characterize 
either  permanent  or  transient  faults.  The  usual  assumption  is  madejhat  the  failure  rate  is  constant 
and,  for  nonredundant  systems  the  reliability  function  becomes  e  ,  where  A  is  is  the  sum  of  the 
failure  rates  of  all  the  components  in  the  system.  A  very  common  quantitative  measure  is  the  Mean 
Time  To  Failure  (MTTF) 


MTTF 


r00 

*  J  R(t) 


dt 


(ID 


The  popularity  of  the  MTTF  stems  mainly  from  the  fact  that,  for  nonredundant  systems,  it  is  easily 
estimated  by  dividing  the  time  a  system  is  operational  by  the  number  of  failures  reported.  Other 


reliability  indices  used  to  compare  two  systems  A  and  B,  are  the  Reliability  Improvement  factor  (RIF) 
[Anderson  67] 


RIF  =» 


1ra<»> 

1rb(,) 


and  the  Mission  Time  Improvement  Factor  (MTIF)  [Bouricious  69] 


(12) 


MTIF  *  when  RA(TA)  =  RB(TB)  »  R^,,  (1,3> 

tb 

which  are  useful  only  when  the  system  under  study  must  be  available  for  a  predetermined  period  of 
time  T  called  "mission  time". 

The  concept  of  coverage  [Bouricious  69]  is  defined  as  the  conditional  probability  of  successful 
recovery,  given  that  a  fault  has  occurred.  Although  mathematically  attractive,  coverage  has  proven  to 
be  very  difficult  to  estimate  for  real  systems.  Finally,  if  the  Mean  Time  To  Repair  (MTTR)  is  also 
known,  an  estimate  of  the  system  usefulness  given  by  the  Availability 
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A  * 


MTTF 

MTTF  +  MTTR 


(1.4) 


These  and  other  measures  traditionally  used  to  compare  systems  do  not  take  into  account  the 
performance  of  the  system  whose  reliability  is  being  measured.  Consider  Table  1-1  which  lists  the 
results  obtained  from  seven  different  experiments  whose  goal  was  explicitly  to  gain  experience  on 
systems  reliability.  Data  for  the  first  system  [Yourdon  72],  was  obtained  from  a  summary  of  failure 
statistics  on  a  Borroughs  5500  over  a  15  month  period  starting  in  April  of  1969.  Limited  information 
about  the  cause  of  each  failure  is  available.  For  instance,  one  of  the  categories  includes  system 
failures  due  to  unexpected  I/O  intercepts.  These  failures  are  recorded  whenever  the  software 
responds  to  an  interrupt  signifying  that  some  I/O  action  has  taken  place,  but  discovers  that  it  has  no 
record  of  having  initiated  such  action.  It  is  thus  an  indication  of  some  form  of  hardware  or  software 
error  but  the  particular  cause  for  the  failure  (hardware  or  software)  remains  unknown.  The  data  for 
the  second  system  was  reported  in  (Lynch  75]  and  comes  from  the  first  thirteen  months  of  operation 
of  an  operating  system  called  Chi/OS  for  the  Univac  1108  developed  by  the  Chi  Corporation  between 
1970  and  1973.  No  explanation  is  given  about  how  such  an  accurate  decomposition  of  failures  due  to 
hardware  and  software  could  be  obtained.  [Reynolds  75]  reports  data  obtained  from  a  dual  IBM 
370/165  at  Hughes  Aircraft  Company  over  a  period  of  three  years  installed  to  handle  a  mixed  batch 
and  time  sharing  load.  The  forth  system  is  at  the  Stanford  Linear  Accelerator  Center  (SLAC)  where 
the  main  workload  is  processed  as  multi-stream  background  batch.  The  system  consists  of  a 
foreground  host  (IBM  370/168)  and  two  background  batch  servers  (IBM  370/168  and  IBM  360/91). 
The  architecture  is  designed  to  be  highly  available  and  reconfigurable.  The  CMU-10A  is  an  ECL  PDP- 
10  used  in  the  Computer  Science  Department  at  Carnegie-Mellon  University.  The  data  for  the  CRAY- 
1  was  reported  in  [Keller  76],  and  the  data  for  the  three  generic  UNIVAC  systems  was  reported  in 
[Siewiorek  80]. 

Table  1-1  gives,  when  available,  a  Mean  Time  to  restart  (MTTS)  value  in  hours  (that  is,  the  Mean 
Time  to  System  Failure),  a  Mean  Number  of  Instructions  to  Restart  (MNIR)  which  is  an  estimate  of  the 
mean  number  of  instructions  executed  from  system  start  up  until  system  failure,  and  the  percentages 
of  system  failures  that  were  caused  by  hardware  faults,  software  faults,  and  whose  cause  could  not  be 
resolved.  The  information  about  execution  rates,  needed  to  compute  the  MNIR  value  was  obtained 
from  [Phister  79]. 


Obviously,  the  figures  shown  in  Table  1.1.  do  not  carry  much  information.  A  MTTS  figure  alone 
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System 

MTTS  (hours) 

MNIR 

%  HW 

%SW 

%  Unknown 

B  5500 

14.7 

2.6  1010 

39.3 

8.1 

52.6 

Chi/05 
(Univac  1106) 

.17 

6,7  1010 

45 

55 

- 

dual 

370/165 

8.86 

2.8  1011 

65 

32 

3 

SLAC 

20.2 

2.3  1011 

73.3 

21.6 

5.1 

CMU-10A 

10 

4.3  1010 

- 

- 

- 

CRAY-1 

4 

1.91012 

- 

• 

• 

UNIVAC 

(Large) 

51 

42 

7 

UNIVAC 

(Medium) 

57 

41 

2 

UNIVAC 

(Small) 

88 

9 

3 

Table  1-1:  Reliability  experience  of  several  comercial  systems. 

MTTS  is  the  Mean  Time  to  restart.  MNIR  is  the  Mean  Number  of 
Instructions  to  Restart. 

does  not  tell  the  impact  of  unreliability  on  system  use.  Compare  fcr  example  the  CRAY-1,  [Russell  78], 
with  the  CMUA,  [Bell  78].  Although  the  CRAY-1  crashes  twice  as  often  as  the  CMUA,  it  can  operate 
continously  at  rates  above  138  Million  Instructions  Per  Second  (MIPS),  while  the  CMUA  operates  at 
1 .2  MIPS.  Hence  the  CMUA  executes  -1010  instructions  between  crashes  while  the  CRAY-1  executes 
~1012  instructions  between  crashes.  Inconstancies  like  this  one  suggest  that  reliability  modelling 
and  measuring  should  be  closely  related  with  the  characterization  of  the  performance  of  the  system 
under  study.  Integrated  performance-reliability  models  have  already  started  to  appear  in  the 
literature.  In  [Meyer  79],  a  performance  measure  called  "performability"  gives  the  probability  that  a 
system  performs  at  different  levels  of  "accomplishment".  In  [Gay  79],  systems  are  modelled  with 
Markov  processes  in  order  to  estimate  the  probability  of  being  in  one  of  several  capacity  states.  This 
is  a  similar  approach  tc  the  one  previoulsy  taken  in  [Beaudry  78],  where  the  concept  of  "computation 
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reliability"  was  introduced  as  a  measure  which  takes  into  account  the  computation  capacity  of  a 
system  in  each  possible  operational  state.  Finally,  a  Performance/Availability  model  for  gracefully 
degrading  systems  with  critically  shared  resources  is  given  in  [Chou  80]. 

However,  most  of  the  above  models  have  been  developed  mainly  for  hard  failures,  that  is,  strMe 
failures  that  reflect  an  irreversible  physical  change  in  the  hardware.  Unfortunately,  as  it  has  been 
repeatedly  reported  ( [Fuller  78],  [McConnel  79],  [Morganti  78],  [Siewiorek  78],  [Ohm  79]),  transient 
failures  occur  at  least  an  order  of  magnitude  more  often  than  hard  failures.  A  cost  effective  analysis 
should  then  consider  transients  as  the  main  reason  for  system  unreliability. 

Simultaneoufsy  with  the  developments  described  above,  qualitative  relationships  between 
workload  and  unreliability  have  also  been  noted.  The  results  published  in  [Beaudry  79]  suggest  a 
strong  dependency  between  workload  and  reliability  of  digital  computing  systems.  And  in  the  paper 
by  [Butner  80],  this  dependency  is  stated  explicitly  claiming  that  a  periodic,  workload-dependent 
failure  rate  is  more  appropriate  to  characterize  the  reliability  of  time-sharing  systems  than  the 
classical  constant  failure  rate  model  traditionally  used.  As  reported  in  [Castillo  80],  if  such  a 
dependency  is  taken  into  account  it  is  possible  to  characterize  the  performance  of  digital  computing 
systems  considering  reliability  as  an  inherent  attribute. 


1 .3  Software  Reliability 

The  problem  of  software  reliability  assessment  is  part  of  the  more  general  area  of  software  quality 
assessment  [Mohanly  73].  Effective  machanisms  for  measuring  software  quality  are  required  due  to 
the  high  cost  of  software  development  and  maintenance.  By  1985  forecasts  indicate  that  over  90%  of 
the  total  computing  dollars  spent  annually  will  be  for  software  [Horowitz  75].  The  development  of 
techniques  for  measuring  software  reliability  has  been  motivated  mainly  by  project  managers  that 
need  both  ways  of  estimating  the  man-power  needed  to  develop  a  software  system  with  a  given  level 
of  performance  and  techniques  to  detect  when  this  level  of  performance  has  been  reached.  However, 
most  software  reliability  models  presented  up  to  date  are  still  far  from  satisfying  these  two  needs  in  a 
general  context. 

Software  reliability  models  can  be  roughly  grouped  in  four  categories.  The  first  category  would 
include  models  formulated  in  the  time  domain.  These  models  attempt  to  relate  software  reliability 
(characterized,  for  instance,  by  a  MTTF  figure  under  typical  workload  conditions)  to  the  number  of 
bugs  present  in  the  software  at  a  given  time  during  its  development.  Typical  of  this  approach  are  the 
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models  presented  in  [Shooman  73],  [Musa  75],  and  [Jelinsky  73].  Bug  removal  should  increase  MTTF 
and  correlation  of  bug  removal  history  with  the  time  evolution  of  the  MTTF  value  may  allow  the 
prediction  of  when  a  given  MTTF  value  will  be  reached.  An  example  of  the  application  of  time  domain 
models  to  the  development  of  a  real  time  system  is  given  in  [Miyamoto  75].  The  main  disadvantages 
of  time  domain  models  are  that  bug  correction  can  generate  more  bugs,  and  that  software 
unreliability  can  be  due  not  only  to  implementation  errors  (bugs)  but  also  to  design  (specification) 
errors. 

Another  approach  to  software  reliability  modeling  is  based  on  studying  the  data  domain.  The  first 
model  of  this  kind  is  described  in  [Nelson  73].  In  principle,  if  sets  of  all  input  data  values  upon  which  a 
computer  program  can  operate  are  identified,  an  estimated  of  the  reliability  of  the  program  can  be 
obtained  by  running  the  program  for  a  subset  of  input  data  values.  A  more  detailed  description  of  data 
domain  techniques  is  given  in  [Thayer  78].  In  the  paper  by  [Schick  78]  the  time  domain  and  data 
domain  models  are  compared.  However,  different  applications  will  tend  to  use  different  subsets  of  all 
possible  input  data  values,  "seeing"  different  reliability  values  for  the  same  software  system.  This  fact 
is  formally  take  into  account  in  [Cheung  80],  where  software  reliability  is  estimated  from  a  Markov 
model  whose  transition  probabilities  depend  on  a  user  profile.  Techniques  for  evaluating  the 
transition  probabilities  for  a  given  profile  are  given  in  [Cheung  75]. 

The  third  category  includes  models  in  which  software  reliability  (and  software  quality  in  general)  is 
postulated  to  obey  certain  laws  [Ferdinand  74],  [Fitzsimmons  78].  Although  such  models  have 
generated  high  amounts  of  interest,  their  general  validity  has  never  been  proven  and,  at  most,  they 
only  give  a  figure  for  the  number  of  bugs  present  in  a  program. 

Finally,  there  have  been  some  attempts  to  characterize  total  system  reliability  (hardware  and 
software)  in  [Costes  78],  modelling  of  fault-tolerant  software  (through  module  duplication)  in  [Hecht 
76],  and  warnings  about  how  not  to  measure  software  reliability  [Littlewood  79]. 

What  all  the  above  models  have  in  common  is  that  none  of  them  characterizes  system  behavior 
accurately  enough  as  to  give  to  the  user  a  figure  of  guaranteed  level  of  performance  under  general 
workload  conditions.  They  concentrate  in  estimating  number  of  bugs  present  in  a  program  but  do  not 
give  any  accurate  method  to  characterize  and  measure  operational  system  unreliability  due  to 
software.  There  is  a  wide  gap  between  the  varaiables  that  can  be  easily  measured  in  a  running 
system  and  the  number  of  bugs  in  its  operating  system.  However,  a  cost  effective  analysis  should 
precisely  allow  to  evaluate  the  impact  of  software  unreliability  from  variables  easily  accessible  in  an 
operational  system,  without  knowing  the  details  of  how  the  operating  system  has  been  written. 
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1 .4  Measuring  Reliability  under  typical  and  atypical  conditions 

The  assumption  here  is  that  reliability  is  a  performance  attribute  in  the  sense  that  a  lack  of 
reliability  increases  the  expected  value  of  the  system  response  time.  If  such  a  relationship  can  be 
derived  in  a  general  context,  policies  and/or  design  parameters  could  be  used  to  optimize  the 
ultimate  system  performance.  In  this  report,  the  approach  taken  is  that  failure  rate  time  variations 
should  closely  follow  workload  time  variations.  Intuitively,  the  dependency  between  workload  and  lack 
of  reliability  can  be  explained  quite  easily.  Assume  that  we  have  a  constant  failure  rate  for  the  primary 
memory  of  a  digital  computing  system  operating  in  a  stable  environment  under  a  time  sharing  policy. 
That  the  transient  failure  rate  in  a  memory  is  constant  is  a  reasonable  assumption.  There  is 
justification  for  thinking  that  certain  complex  devices  might  follow  an  exponential  failure  law  ( [Barlow 
65],  pp  18-22).  The  physical  characteristics  of  the  memory  IC's  do  not  change  with  time  (at  least 
during  the  effective  life  cycle  of  modern  digital  computing  systems).  We  have  to  look  then  for  the 
origin  of  these  transients  either  in  external  sources,  such  as  radiation,  the  presence  of  noise  (possibly 
impulsive)  in  the  power  supply  or  in  the  limitations  of  the  manufacturing  process.  In  fact,  it  has  been 
reported  in  [Geilhofe  79]  that  MOS  memory  devices  exhibit  non  recurring  bit  failures  caused  by  Alpha 
particles  emitted  from  small  amounts  of  radioactive  elements  present  in  1C  packaging  material.  The 
failure  rate  for  this  kind  of  failures  is  of  course  constant.  Assume  now  that  a  transient  memory  failure 
has  higher  probability  of  leading  to  a  system  crash  when  the  central  processor  is  executing  in  Kernel 
mode  than  when  it  is  executing  in  user  mode.  A  memory  failure  when  the  CPU  is  executing  in  user 
mode  may  affect  a  user  process  but  will  not  crash  the  system.  The  system  failure  rate  due  to  transient 
memory  failures  will  then  depend  on  the  ratio  of  the  number  of  memory  references  while  in  Kernel 
mode  to  the  total  number  of  memory  references  per  unit  time.  Since  it  is  a  well  known  fact  that 
operating  system  overhead  increases  with  workload,  the  previous  ratio  will  also  be  a  nondecreasing 
function  of  the  system  workload,  increasing  in  turn  the  observed  system  failure  rate.  The  result  is  that 
the  observed  system  failure  rate  due  to  transient  memory  failures  should  be  equal  to  the  sum  of  a 
component  following  the  operating  system  ovehead  variations  in  time  (or  indirectly,  workload 
variations  in  time),  plus  a  constant,  workload  independent  component  (even  if  the  system  is  idle,  there 
may  still  be  memory  errors  that  corrupt,  for  instance,  the  clock  interrupt  subroutine). 

Even  if  the  fact  that  a  computing  system  is  not  always  equally  sensitive  to  the  presence  of  hardware 
errors  is  not  considered,  there  are  still  arguments  to  support  the  idea  that  the  apparent  system  failure 
rate  should  depend  on  the  workload.  The  fact  is  that  in  most  computing  systems,  a  component  failure 
will  be  noticed  only  if  the  component  is  "exercised".  A  time  sharing  system  with  no  load,  spending 
most  of  its  time  in  a  wait  state  and  only  a  fraction  of  the  time  executing  the  clock  interrupt  routine  may 
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sustain  several  failures  and  still  not  report  any  errors  if  the  minimal  hardware  configuration  required 
to  execute  these  basic  functions  is  not  affected.  It  is  not  maintained  here  that  failures  will  be  caused 
by  increased  utilization  (although  in  some  cases  this  situation  is  certainly  possible)  but  that  they  will 
be  detected  by  an  increase  in  system  utilization.  This  effect  has  also  been  referred  to  as  "error 
latency"  [Shedletsky  73]. 

Analogous  arguments  lead  to  the  expectation  that  the  rate  of  system  failures  due  to  software 
unreliability  will  depend  on  how  much  the  software  is  exercised.  System  software  failures  are  due  to: 
a)  the  (static)  input  data  to  a  progam  module  presents  some  peculiarities  that  the  program  is  not  able 
of  handling  or,  b)  the  software  is  not  capable  of  handling  some  time  dependent  (dynamic)  sequence 
in  the  input  data  stream.  In  the  case  of  a  time  sharing  system,  the  only  software  capable  of  provoking 
a  system  failure  is  the  Kernel  of  the  Operating  System.  This  software  usually  executes  in  a  privileged 
processor  state  and  a  software  error  that  corrupts  some  critical  information  in  the  Kernel  data 
structures  may  lead  to  a  system  failure.  However,  since  nobody  has  any  a  priori  knowledge  of  what 
these  errors  are,  it  is  less  likely  that  the  system  finds  one  of  these  combinations  in  its  input  stream 
under  low  load  (that  is,  small  amounts  of  input  data  to  process  per  unit  time)  than  in  a  high  load 
situation.  Again,  the  observed  system  failure  rate  has  to  depend  on  the  system  load.  Furthermore, 
upon  correct  system  operation,  a  user  program  is  restricted  to  access  any  resource  for  which  it  has 
not  been  given  explicit  permission  by  the  kernel.  Hence,  it  is  not  necessary  to  worry  about  the  effects 
of  user  programs.  Unfortunately,  a  mathematical  characterization  of  these  phenomena  is  not 
available.  Most  of  the  so  called  software  reliability  models  attempt,  at  most,  to  give  a  figure  for  the 
Mean  Time  To  Failure  of  a  software  system  under  some  "typical"  workload  conditions.  As  will  be  seen 
in  the  following  sections,  the  characterization  of  a  "typical"  workload  is  in  itself  an  important  problem. 

One  of  the  more  important  byproducts  of  considering  a  time  varying  failure  rate  in  which  failures 
can  be  due  either  to  hardware  transients  or  software  design  errors  is  that  the  relative  contribution  of 
software  to  system  unreliability  can  be  estimated  directly  from  the  history  of  system  failures.  From  a 
software  point  of  view,  the  model  presented  here  is  more  in  the  line  of  the  ideas  exposed  in 
[Littlewood  79]  in  the  sense  that  the  concepts  of  bug  identification  and  elimination  should  be 
separated  from  reliability  measurement.  No  one  cares  about  how  many  bugs  remain  in  a  software 
system  if  the  sytem  operates  at  an  acceptable  level  of  performance.  The  modeling  methodology 
presented  in  this  report  does  not  give  any  solution  to  the  problem  of  improving  software  reliability 
(although  it  gives  some  non  trivial  hints  about  how  that  could  be  done)  but  gives  a  method  to 
characterize  the  distribution  of  the  time  to  failure  due  to  software  under  general  workload  conditions. 


The  formal  characterization  of  performance  of  a  digital  computing  system  may  be  very  elusive.  As 
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described  in  [Ferrari  751,  there  are  no  known  system-independent  and  workload- independent 
performance  indices  (two  necessary  properties  to  consider  a  measure  a  universal  measure).  But  the 
average  user  is  only  concerned  in  the  elapsed  time  since  a  computation  is  requested  and  the  correct 
result  is  produced.  This  time  will  depend  on  the  load  of  the  system,  the  operating  system  overhead,  on 
the  probability  that  the  system  fails  and  his  particular  task  has  to  be  restarted,  and  of  course,  on  the 
underlying  hardware  configuration.  It  is  then  an  important  problem  to  establish  formal  quantitative 
relationships  between  workload,  performance,  and  reliability. 

In  summary,  this  report  gives  a  solution  to  the  mathematical  characterization  of  the  relationships 
between  workload,  performance,  and  reliability  due  to  transient  failures  and  software  design  errors. 
The  mathematical  analysis  developed  here  can  be  applied  not  only  to  computing  systems,  but  to  any 
complex  systems  in  which  reliability  is  an  important  characteristic  and  for  which  some  knowledge 
about  workload  variations  is  available.  Since  a  large  class  of  these  systems  operate  under  a  quasi- 
periodic  demand  (such  as  public  transportation  systems,  power  distribution  networks,  time  sharing 
and  some  real-time  computing  systems,  etc.),  the  mathematical  characterization  has  been  developed 
first  for  systems  in  which  the  workload  can  be  characterized  by  a  cyclostationary  stochastic  process 
(a  time  varying  stochastic  processes  with  periodic  mean  and  variance). 

In  Section  2  the  formal  assumptions  made  in  the  characterization  of  the  failure  process  of  a  time- 
shared  computer  are  stated  in  detail.  Also,  general  expressions  are  derived  for  the  Probability 
Distribution  Function,  Reliability  Function,  and  Hazard  Function  of  the  times  to  hardware  failure, 
software  failure,  and  system  failure  when  the  system  overhead  is  described  by  a  cyclostationary 
process  that  can  be  approximated  by  a  periodic  function  plus  a  "corrected”  zero  mean  Gaussian 
process. 

The  results  presented  in  Section  2  are  elaborated  in  Section  3  where  the  failure  process  of  a  real 
system  is  studied  in  detail,  and  exact  expressions  are  given  that  characterize  the  software,  hardware, 
and  system  reliability  for  that  partcuiar  system.  In  Section  4  these  results  are  compared  with  the 
available  data  regarding  the  reliability  of  the  system  under  consideration  and  with  the  characterization 
that  would  result  from  more  traditional  models  such  as  constant  failure  rate  (time  to  failure  expontially 
distributed),  Weibull,  and  periodic  failure  rate.  Finally,  in  Section  5,  a  list  of  items  to  be  further 
investigated  is  proposed,  along  with  some  preliminary  conclusions.  Two  mathematical  derivations 
particularly  tedious  have  been  left  to  appendices  so  not  to  distract  the  reader  with  cumbersome 
details  that  are  not  relevant  to  the  ideas  presented  in  this  report. 
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2.  Mathematical  characterization 


This  Section  gives  the  mathematical  basis  of  a  model  able  of  predicting  and  calibrating  the 
unreliability  of  digital  computers  due  to  hardware  transients  and  software  errors.  First,  the  necessary 
definitions  are  given  in  Section  2-1 .  The  assumptions  that  the  systems  to  be  modelled  are  assumed  to 
satisfy  are  stated  in  detail  in  Section  2-2.  In  Section  2-3  a  mathematical  skeleton  is  built  based  on 
these  assumptions.  The  result  is  a  general  expression  for  the  Probability  Distribution  Function  (PDF) 
of  the  time  to  system  failure.  Finally,  in  Section  2.3.  the  general  procedure  for  evaluating  the  maximum 
likelihood  estimates  of  the  model  parameters  is  outlined. 

2.1  Definitions 

A  stochastic  process  {x(t,«);  t€T,  we£2}  is  a  family  of  random  variables  all  defined  in  the  same 
probability  space  S2  and  indexed  by  a  real  parameter  t  that  takes  values  in  a  parameter  set  T  called  the 
index  set  of  the  process.  The  indexing  parameter  t  will  represent  time  in  all  the  processes  presented  in 
this  report  and  T  will  always  be  equal  to  the  real  line  1R,  that  is,  only  continuous  time  processes  will  be 
considered.  For  each  fixed  telR,  x(t,u)  as  a  function  of  u  will  be  a  real  valued  random  variable.  For 
each  u€Q,  x(t,<o)  as  a  function  of  t  will  be  a  real  valued  function  of  time  called  a  realization  of  the 
process.  The  set  of  all  these  time  functions  is  called  the  ensemble  of  the  process. 

Definition  1 :  A  counting  process  {N(t,w);  t>tQ)  is  a  stochastic  process  having  the  set 
IN  +  =  {0,1 ,2,.., oo)  of  nonnegative  integers  as  its  state  space. 

For  each  <o€i2,  N(t,w)  is  a  piecewise-constant  function  of  t  with  jumps  at  t^w),  t2(w) . tn(«),  the 

values  of  t,  .  t  depending  on  the  realization  of  the  process.  Counting  processes  are  always 

associated  with  point  processes,  the  value  of  N(t,w)  for  t<Ktj  +  1  being  the  total  number  of  "points" 
generated  up  to  t.  +  r  All  counting  processes  presented  in  this  report  will  be  associated  to  failure 
processes  of  a  given  system,  the  value  of  N(t,w)  for  t.^t<ti  +  1  being  the  number  of  system  failures 
detected  up  to  tj+  r  A  typical  realization  or  sample  function  of  a  counting  process  is  shown  in  fugure 
2-1. 

Definition  2:  A  Poisson  process  is  a  counting  process  (N(t) ;  t^tg}  with  the  following 
three  properties : 

1. PrMty-O]  -  1 

2.  For  t0<s<t ,  the  increment  N(s,t)  =  N(t)-N(s)  is  Poisson  distributed  with  parameter 
A(t)- A(s),  where  A(t)  is  a  nonnegative,  nondecreasing  function  of  t. 


3.  {N(t);t^t0}  has  independent  increments. 
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Figure  2- 1 :  A  possible  sample  function  of  a  counting  process. 


Property  3  is  the  distinguishing  property.  It  means  that  for  a  Poisson  counting  process,  the  number 
of  points  in  nonoverlapping  intervals  are  statistically  independent  random  variables,  no  matter  how 
large  or  small  the  intervals  are  and  no  matter  how  distant  or  close  they  may  be.  The  function  A(t)  in 
property  2  is  termed  the  parameter  function  of  the  process.  If  A(t)  is  an  absolutely  continuous 
function  of  t,  it  can  be  expressed  as 


A(t) 


/ 


X(t)  dr 


(2.1) 


where  X(r)  is  a  nonnegative  function  of  t  for  t^tg.  The  function  X(r)  is  termed  the  intensity  function  of 
the  process  N(t).  At  any  time  t>tg,  the  intensity  function  X(t)  is  the  instantaneous  average  rate  at 
which  points  occur.  If  N(t)  is  a  failure  process  X(t)  is  the  failure  rate  of  the  process. 

Definition  3:  A  Poisson  process  is  said  to  be  homogeneous  when  the  intensity  function 
X(t)  is  a  constant  independent  of  time. 

Definition  4:  Whenever  the  intensity  function  X(t)  is  not  a  constant  but  a  deterministic 
function  of  time,  the  corresponding  Poisson  process  is  said  to  be  inhomogeneous. 

Definition  5:  Let  x(t)  be  a  stochastic  process  that  is  an  "outside"  process  influencing 
the  evolution  of  a  counting  process  {N(t);t>tg}.  N(t)  is  a  doubly  stochastic  Poisson  process 
with  intensity  process  {X(t,x(t));t£tg}  if  for  almost  every  realization  of  the  process  x(t),  N(t) 
is  a  Poisson  process  with  intensity  process  function  X(t,x(t)). 


The  process  x(t)  carries  the  information  about  how  the  intensity  process  varies,  and  for  this  reason 
will  be  also  called  the  information  process. 
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Definition  6:  A  stationary  process  (in  the  strict  sense)  is  a  stochastic  process 

{x(t),t€T}  with  the  property  that  for  any  positive  integer  k  and  any  points  t1 . ,tk  and  h  in 

T,  the  joint  distribution 

My. . x(y} 

is  the  same  distribution  of 

{x(t,  +  h) . x(tk  +  k)} 

Intuitively,  a  process  is  stationary  if  it  has  the  same  joint  statistics  regardless  of  where  the  time 
origin  is  set.  Htjmse,  if  x(t)  is  a  stationary  Gaussian  process,  the  joint  distribution  function  of 

{x(tt  +  h) . x(tk  +  h)}  is  a  multivariate  Gaussian  distribution  whose  covariance  matrix  is  independent 

of  h. 

Definition  7:  The  Autocorrelation  function  R^yy  of  a  process  x(t)  is  defined  as 


Rxx(ti,t2)  ■  EWVXU*)} 

*  J  Px(t1)Ix(t2)^a1,a2^Clatda2 

where  E{..}  stands  for  expected  value  and  p  ,  ...  .(a., a,)  is  the  joint  probability  density 
function  of  x(t1)  and  x(y.  1  ’  * 

If  x(t)  is  stationary  and  real,  Rxx(trt2)  depends  only  on  the  time  difference  r  =  |t1-t2|  and 

Rxx(r)  =»  E{x(t+  r)x(t)} 

Definition  8:  A  stochastic  process  x(t,«)  is  ergodic  in  the  most  general  sense  if  all  its 
statistics  can  be  determined  from  a  single  realization  x^.Wq)  of  the  process. 

Loosely  speaking,  a  process  is  ergodic  if  time  averages  (the  only  ones  that  can  be  obtained  from  a 
single  realization  of  the  process)  equal  ensemble  averages  (i.e.  expected  values).  Obviously, 
ergodicity  can  be  defined  with  respect  to  certain  parameters  of  the  process.  Only  ergodicity  with 
respect  to  the  autocorrelation  function  will  be  needed  in  this  report,  which  is  defined  as  follows : 

Definition  9:  A  stochastic  function  is  ergodic  with  respect  to  the  autocorrelation 
function  if 


Rxx<T> 


Hm  _L 
TtOOjr 


x(t  +  r)x(t)  dt 


If  ergodicity  of  the  autocorrelation  function  is  satisfied,  the  autocorrelation  function  can  be 

estimated  by  computing  the  above  integral  for  a  finite  record  of  a  single  realization  of  the  process  x(t). 

Definition  10:  A  real  valued,  continuous  time  stochastic  process  is  defined  to  be  a 
cyclostationary  process  with  period  T  if  and  only  if 
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1 .  E{x(t)}  *  E{x(t  +  T)} 

2.  E{x(t)x(s)}  ■  E{x(t  +  T)x(s  +  T)}  V  s,t 

that  is,  it  is  a  stochastic  process  with  periodic  mean  and  autocorrelation  functions. 

Definition  1 1 :  A  doubly  stochastic  Poisson  process  will  be  said  to  be  a  cyclostationary 
Poisson  process  if  its  information  process  is  cyclostationary. 

In  summary,  and  as  a  short  introduction,  this  report  summarizes  the  results  obtained  by  assuming 
the  failure  processes  of  Time-Sharing  computing  systems  to  be  characterized  by  cyclostationary 
Poisson  processes. 

2.2  Basic  assumptions  made  in  the  characterization  of  failure 
processes 

First,  the  behavior  of  failures  in  Time-Sharing  computing  systems  will  be  characterized.  Since  the 
occurrence  of  failures  is  random,  a  necessary  requirement  to  understand  the  process  of  how  a  lack  of 
reliability  affects  the  performance  of  a  system  is  to  find  an  expression  for  the  probability  density 
function  of  the  time  to  failure. 

2.2.1  Characterization  of  the  failure  process 

The  approach  taken  has  been  to  assume  that  the  different  subsystems  failure  processes  can  be 
accurately  modeled  by  cyclostationary  Poisson  processes.  Although  it  is  common  in  reliability  theory 
to  assume  that  failure  processes  are  properly  modeled  by  Poisson  processes,  one  may  well  wonder 
why  this  assumption  leads  to  good  results.  There  are  at  least  three  reasons  for  characterizing  failure 
processes  with  Poisson  processes. 

First,  the  conditions  for  a  Poisson  process  are  very  likely  to  be  valid  for  many  physical 
environments.  Qualitatively,  these  conditions  can  be  summarized  as  follows : 

•Two  failures  cannot  occur  simultaneously. 

•At  any  time,  there  exists  an  instantaneous  failure  rate  at  which  failures  occur  per  unit  time 
and  such  that  the  value  of  this  instantaneous  failure  rate  is  independent  of  the  past 
history  of  the  system. 

•the  number  of  failures  at  start  time  is  zero. 

(see  [Sneyder  75]  for  a  formal  proof  that  the  above  are  sufficient  conditions  for  a  process  to  be 
Poisson).  If  this  "instantaneous"  failure  rate  is  a  constant,  the  above  three  conditions  define  a 
homogeneous  Poisson  process,  for  which  the  interarrival  times  are  independent  and  exponentially 
distributed  random  variables.  If  the  failure  rate  is  a  deterministic  function  of  time,  a  nonhomogenous 


Disk  Failures  Blocks  Accessed 
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Figure  2-2:  Average  number  of  blocks  accessed  in  the  file 
system  as  a  function  of  time  of  day. 


Figure  2-3:  Disks  failures  as  a  function  of  time  of  day. 
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Poisson  process  is  defined.  Finally,  if  the  failure  rate  is  another  stochastic  process,  the  above  three 
definitions  define  a  doubly  stochastic  Poisson  process. 

The  second  reason  for  using  a  Poisson  process  is  that  whenever  we  have  a  point  process  that  is  the 
result  of  pooling  the  points  of  many  independent  point  processes  (  whatever  their  characterization 
may  be),  and  the  component  processes  are  sufficiently  sparse,  the  pooled  process  converge  to  a 
Poisson  process  [Cinlar  72].  This  is  certainly  the  case  of  modem  digital  computing  systems.  The 
complexity  of  a  minicomputer  like  the  PDP-1 1  /40  [Bell  78b]  in  a  minimal  configuration  of  64  Kbytes  of 
memory,  clock,  and  a  terminal  interface  is  on  the  order  of  103  1C  packages.  For  an  supercomputer  like 
the  CRAY-l  [Russell  78],  the  complexity  is  on  the  order  of  10s  1C  packages.  The  average  Mean  Time 
To  Failure  (MTTF)  per  component  is  on  the  order  of  10s  hours  (-  103  years)  for  hard  failures  [Hodges 
77].  Hence,  the  system  failure  /ate  due  to  transients  is  the  superposition  of  ~103  failure  processes, 
the  probability  of  observing  a  failure  of  any  of  the  component  processes  in  a  meaningful  time  interval 
is  very  small  (of  the  ord«*  qf  ivi  *  fir  a  month  interval).  The  fact  that  the  superposition  of  sparse  point 
processes  converges  to  a  Poisson  process  guarantees  that,  independently  of  the  characterization  of 
each  of  the  component  processes,  the  system  failure  process  will  be  very  close  to  a  (non  necessarely 
homogeneous)  Poisson  process. 

Finally  and  most  importantly,  even  if  system  characterization  by  means  of  Poisson  processes  is  only 
approximate,  these  processes  are  very  well  understood  and  fairly  complex  mathematical  tools  exist 


Figure  2-4:  Number  of  blocks  accessed  per  unit  time  in  a  file 
system  during  five  consecutive  weekdays  (millions  of  blocks 
accessed  per  5  minutes). 
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Figure  2-5:  Average  fraction  of  time  in  kernel  mode,  k(t),  as  a 
function  of  time  of  day. 
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Figure  2-8:  System  failures  (restarts)  as  a  function  of  time  of 
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That  a  doubly  stochastic  Poisson  process  should  be  used  (that  is,  the  failure  rate  being  another 
stochastic  process)  is  a  fact  suggested  by  the  data  presented  in  Figures  2-2  thru  2-6.  Figure  2-4 
shows  the  values  of  the  number  of  blocks  read  and  written  to  the  file  system  of  a  time-sharing  digital 
computing  system  during  five  consecutive  weekdays.  There  is  a  clear  (although  nondeterministic) 
periodicity  in  this  data  .Note,  for  instance,  that  there  is  always  a  peak  after  a  new  day  is  started.  This 
peak  is  due  to  the  backup  of  disks  to  magnetic  tape,  which  the  operator  does  daily  after  midnight.  If 
we  average  the  data  for  these  five  days  and  plot  the  profile  for  the  average  disks  use  in  one  day  we 
see  that  there  is  a  common  time  varying  pattern  for  all  days  (Figure  2-2).  If  we  now  examine  the  one 
day  profile  of  disks  failures  detected  during  the  same  period  of  time  in  Figure  2-3  we  note  a 
remarkable  similarity  between  the  two  plots.  Although  different,  the  plots  in  Figures  2-2  and  2-3 
present  the  main  peaks  and  valleys  at  approximately  the  same  time  of  day.  It  seems  that  in  the  long 
run,  after  averaging  over  a  one  day  period  both  the  failure  rate  and  the  system  usage  variables  show 
the  same  temporal  behavior.  If  such  a  dependency  exists  instantaneoulsy,  that  is,  if  the  failure  rate  at 
a  given  time  depends  on  the  system  load  at  that  time,  it  is  clear  that  the  failure  rate  must  be 
characterized  as  a  stochastic  process,  since  the  load  variations  presented  in  Figure  2-4  cannot  be 
considered  deterministic. 

Figures  2-5  and  2-6  show  the  average  fraction  of  time  in  kernel  mode  for  a  Time-Sharing  system  and 
the  number  of  crashes  detected  during  29  days,  both  plots  as  a  function  of  time  of  day.  Again,  there  is 
some  simlilarity  between  the  two  plots.  The  fraction  of  time  in  kernel  mode  for  a  Time  Sharing  system 
during  five  consecutive  days,  shown  in  Figure  2-7  suggests  a  cyclostationary  process. 

Figures  2-2  thru  2-6  should  be  enough  evidence  to  justify  an  experiment  based  on  the  assumption 
that  failure  rate  is  a  stochastic  process.  Let  \(t)  be  the  value  of  the  instantaneous  failure  rate  at  time 
t.  For  a  doubly  stochastic  Poisson  process,  the  probability  density  function  of  the  time  between 
failures  conditioned  to  a  realization  of  the  process  A(t)  is  given  by  [Sneyder  75] 

/X(T)dT 

(2-2) 


2.2.2  Failure  rate  characterization 

Based  on  the  arguments  in  Section  1,  it  will  be  assumed  that  the  instantaneous  value  of  the  failure 
rate  for  a  particular  resource  is  a  nondecreasing  function  of  the  "utilization"  of  that  resource.  For 
instance,  more  failures  per  unit  time  will  be  detected  in  a  file  system  when  the  number  of  blocks  read 
and  written  to  the  disks  per  unit  time  is  near  its  maximum  value  than  when  it  is  used  only  occasionally. 
The  fact  that  system  crashes  occur  more  often  in  periods  of  high  load  has  been  noted  in  [Butner  80]. 
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Figure  2-7:  Fraction  of  time  in  kernel  mode,  k(t),  during  five 
consecutive  weekdays. 

The  exact  nature  of  the  functions  relating  resource  utilization  and  failure  rates  may  be  complex, 
different  for  each  resource  and  difficult  to  characterize  from  observed  data.  Since  no  previous 
experience  has  been  reported  of  working  under  these  assumptions,  a  cautious  approach  will  be  taken 
and,  as  a  first  step,  only  linear  relationships  will  be  considered.  In  general  then,  the  failure  rate  X(t)  of 
a  particular  resource  whose  use  is  characterized  by  a  function  u(t)  will  be  given  by 


A(t)  a  au(t)  +  b 


(2.3) 


where  u(t)  will  be  a  function  such  as  the  ones  shown  in  Figures  2-4  or  2-7.  For  instance,  the  failure 
rate  of  a  file  system  Adk(t)  will  be  given  by 

Adk(t)  -  sdk  b(t)  +  cdk  (2.4) 

where  b(t)  is  equal  to  the  sum  of  blocks  read  and  written  to  the  file  system  per  unit  time  as  shown  in 
Figure  2-4.  sdk  is  a  sensitivity  coefficient  relating-  disks  usage  to  failure  rate  and  the  offset  term  cdk 
should  take  care  of  any  possible  drift  in  the  relation  between  usage  and  failure  rate. 
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The  system  failure  rate,  that  is,  the  rate  at  which  the  system  crashes  and  has  to  be  restarted  from 
scratch  is  not  so  obviously  characterized.  The  protection  mechanisms  provided  by  the  state  of  the  art 
operating  systems  and  computer  architectures  try  to  maintain  continued  system  operation  regardless 
of  individual  component  or  subsystem  failures.  The  fact  is  that  in  most  computers  the  CPU  executes  in 
one  of  several  processing  modes,  each  of  the  modes  having  different  privileges  respect  to  the  overall 
system  control.  A  system  crash  due  to  a  hardware  transient  is  only  possible  when  it  affects  the 
operation  of  the  system  in  the  most  privileged  mode,  the  only  one  able  of  halting  the  entire  system  or 
entering  into  an  infinite  loop  with  no  other  entity  capable  of  correcting  the  situation.  This  most 
privileged  mode  of  operation  is  usually  referred  to  as  the  kernel,  and  the  system  failure  rate  should  be 
a  nondecreasing  function  of  the  fraction  of  time  that  the  system  operates  in  kernel  mode,  that  is, 


V»  -  *  S.  «•« 

where  X^t)  is  the  system  failure  rate  due  to  hardware  transients,  shw  is  a  sensitivity  coefficient,  k(t)  is 
the  instantaneous  value  of  the  fraction  of  time  that  the  system  operates  in  kernel  mode  and  chw  is  a 
residual,  workload  independent,  failure  rate  (even  if  the  kernel  is  only  slightly  exercised  there  is  the 
possibility  that  a  transient  in  the  main  mamory  will  corrupt  parts  of  the  kernel  data  structures). 

The  system  failure  rate  due  to  software  errors  will  also  depend  on  the  fraction  of  time  that  the 
system  operates  in  kernel  mode  because  the  kernel  of  the  operating  system  is  the  only  software 
capable  of  leading  to  a  system  crash.  However,  when  the  workload  is  very  low,  and  the  kernel 
executes  only  relatively  simple  operations  it  is  to  be  expected  that  this  part  of  the  kernel  will  be  well 
debugged  such  that  the  system  failure  rate  is  zero  for  low  values  of  k(t).  It  will  then  be  assumed  that 
the  software  failure  rate  will  be  zero  for  values  of  k(t)  below  a  threshold  value  kg  and  increase  with  k(t) 
above  kQ.  Again,  the  relationship  between  k(t)  and  failure  rate  will  be  assumed  linear  such  that 


*  to3" 


k(t) 


sswk0 


if  k(t)  >  kQ 
otherwise 


(2.6) 


where  X^tt)  is  the  system  failure  rate  due  to  software  errors,  s^  is  its  sensitivity  coefficient,  and  k0  is 
the  value  of  k(t)  below  wich  this  failure  rate  is  zero.  X^ft)  can  be  rewritten  as 
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ssJk(t)-R(t)]-sawk0 


(2.7) 


where 


K(t)  =  { 


k(t)  ifk(t)<k0 
0  otherwise 


(2.8) 


The  following  expression  can  then  be  obtained  for  the  system  failure  rate,  that  is,  the  rate  at  which 
the  system  crashes  due  either  to  hardware  transients  or  software  errors 

V()  =  [shw  +  Ss*Jk«  +  chw  ‘  sswRW '  <2-9> 


Only  these  three  cases  (the  failure  process  of  a  file  system,  the  system  failure  process  due  to 
transients,  and  the  system  failure  process  due  to  software  errors)  will  be  studied  in  this  report. 
Expressions  for  the  probability  density  function  of  the  time  to  failure  and  reliability  function  for  the 
three  cases  are  given  in  Sections  2-2  thur  2-4. 

2.2.3  Workload  characterization 

Something  more  can  be  said  about  the  "utilization"  functions.  Although  being  nonstationary 
processes,  it  is  obvious  that  due  to  the  operational  policies  that  regulate  the  use  of  Time-Sharing 
systems,  they  will  have  a  periodic  behavior.  The  second  hypothesis  that  we  make  is  that  workload, 
and  hence  system  usage  for  time  sharing  systems  can  be  modeled  as  a  cyclostationary  process 
[Gardner  75],  [Gardner  78].  A  cyclostationary  process  is  defined  as  a  second  order  process  with 
periodic  mean  and  autocorrelation  function.  The  periodicity  of  the  mean  is  obvious  from  Figure  2-4 
and  in  fact  it  is  possible  to  make  the  simplifying  assumption  that  the  workload  causing  such  overhead 
can  be  described  by  a  periodic  (hence  deterministic)  function  of  time.  This  is  the  approach  taken  in 
[Butner  80],  where  it  is  expected  that  a  periodic  failure  rate  Poisson  process  will  lead  to  a  more 
accurate  failure  process  characterization  than  a  homomgeneous  Poisson  process  model  (time  to 
failure  exponentially  distributed).  Here,  the  instantaneous  value  of  the  failure  rate  will  be  considered  a 
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random  variable  with  periodic  mean,  and  the  failure  rate  will  be  a  cyclostationary  process.  The  third 
hypothesis  is  that  u(t)  (the  usage  function  of  a  particular  system  resource)  can  be  properly  modeled 
by  adding  a  deterministic,  period  function  of  time  m(t)  plus  a  stationary,  zero  mean,  Gaussian  process. 
That  is, 


u(t)  a  m(t)  +  z(t) 


(2.10) 


such  that  in  general 


\(t)  *  am(t)  +  az(t)  +  b 


(2.11) 


where  m(t)  is  a  periodic,  deterministic  function  of  time  and  z(t)  is  a  stationary,  zero  mean,  Gaussian 
process,  independent  of  m(t).  This  third  hypothesis,  although  attractive,  cannot  be  correct.  If  z(t)  is  a 
purely  Gaussian  process,  there  is  a  non-zero  probability  that  A(t)<0  and  the  above  expression  cannot 
be  used  as  a  failure  rate  of  a  Poisson  process.  To  avoid  this  problem  let 


z(t) 


{ 


m  .  +  z(t)  -  u  . 
min  w  mm 


iUftHu^.mft) 

otherwise 


and  set 


(2.12) 


u(t)  =  m(t)  +  z(t)  -  z(t) 


(2.13) 


from  where  we  obtain 

Xs(t)  =  a[  m(t)  +  z(t)  -  z(t)  ]  +  b 

a  a  m(t)  +  b  +  a(z(t)  -  z(t)] 


(2.14) 


(2.15) 


In  summary,  the  three  hypothesis  in  which  this  work  is  based  are : 

1.  The  failure  process  of  a  digital  computing  system  can  be  bribed  by  a  doubly 
stochastic  Poisson  process. 
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2.  The  failure  rate  is  a  linear  function  of  the  operating  system  overhead  (so,  indirectly, 
depends  on  the  system  workload) 

3.  For  computing  systems  with  cyclostationary  workload,  system  overhead  time  variations 
can  be  modeled  as  a  periodic,  deterministic  function  of  time  plus  a  stationary,  zero  mean 
Gaussian  process,  independent  of  the  underlying  periodic  function  and  adequately 
corrected  in  order  to  have  a  positive  failure  rate. 

Note  that  assumption  one  is  much  less  restrictive  than  the  usual  assumption  of  considering  the  time 
between  failures  being  exponentially  distributed  (i.e.,  the  failure  process  is  usually  considered  a 
homogeneous  Poisson  process).  In  later  Sections,  the  insight  gained  in  understanding  system 
behavior  from  dropping  this  oversimplification  will  be  discussed.  Also,  the  implications  o'  considering 
(or  not  considering)  assumptions  2  and  3  will  be  discussed. 


2.3  Characterization  of  a  file  system  failure  process 

As  a  first  application  of  the  hypothesis  described  above,  the  failure  process  of  a  file  system  under 
cyclostationary  workload  will  be  studied  in  detail.  The  hypothesis  are  that  the  subsystem  failure  rate  is 
given  by 


-  a,*  b(t)  +  cri 


(2.16) 


where 


b(t)  =  mdk(t)  +  zdk(t)-zdk(t) 


(2.17) 


The  pdf  of  the  time  to  failure  conditioned  to  a  realization  of  the  process  X ..  (t)  is  given  by 


T— ^  3  9 


dk'  '  w  'Iq 


/  Xdk«T>dT 


(2.18) 


The  general  pdf  is  given  by 


PdkW  3 


/  VT>dT } 


(2.19) 
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► 


where  the  expectation  is  taken  over  the  ensemble  realizations  of  the  process  \dlt(t).  It  is  shown  in  the 
Appendix  I  that,  under  the  assumption  that  Xdk(t)  is  given  by  (2.14),  the  above  expectation  is  equal  to, 

pdk(t)  -  -  £{  <6dk(t)  ‘  +  °dk '  Pdk(Sdk'b(rtn)lt }  (2.20) 

The  meanings  and  values  of  each  of  the  parameters  on  which  p(t)  depends  are  described  in  detail  in 
Appendix  I,  and  only  a  summary  will  be  given  here.  0dk( t)  is  a  periodic  function  of  time,  depending  on 
the  periodic  component  of  Adk(t).  The  first  term  in  the  exponent  is  the  variance  of  the  integral  of  zdk(t), 
and  depends  on  the  autocorrelation  function  of  zdk(t),  Ra(r).  The  last  term  depends  on  the  mean 
value  of  the  deterministic  part  of  Xdk(t)  and  the  correction  factor  Pdk(sdk,bmjn)  takes  care  of  the 
contribution  of  zdk(t).  Finally,  it  should  be  noted  that  this  expression  is  only  valid  when  the  second 
derivative  of  the  autocorrelation  function  of  zdk(t)  at  the  origin  is  finite. 

The  following  expression  can  be  obtained  for  the  Probability  Distribution  Function  of  the  time 
between  errors : 


pdk(t<T)=  f  p(t)dt 
Jo 

3dkg  ^  3dk  ’  [sdkm dk  +  cdk  •  Pd|,(S(j|c.bmjn)lT 

*  «V0)  e  dk  2  .  ^dk(r)  e  dk  2  1  dk  dk  dk  Kdk  dk  nun 

2 

=  1-0  (t)  QSdk%fT>  '  fsdkradk  "  f5dkf9dk'bmin^T 


(2.21) 


(2.22) 


(2.23) 


To  compare  our  model  with  a  real  system  we  still  need  to  estimate  the  parameters  sdk  and  cdh  from 
observed  data  and  obtain  analytical  expressions  for  the  autocorrelation  function  Rn(r)  and  the 
variance  a2(t)  in  equation  (2.23).  The  general  problem  of  parameter  estimation  for  doubly  stochastic 
Poisson  processes  is  described  in  Section  2-5,  and  a  numerical  procedure  for  estimating  sdk  is  given 
in  section . 


2.4  The  system  failure  process 

The  expression  for  the  system  failure  rate  due  to  hardware  transients  and  software  errors  has  been 
given  in  (2.9),  where  k(t)  is  the  fraction  of  time  that  the  system  operates  in  kernel  mode.  With  the 
hypothesis  that 


u 


i 
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k(t)  =  msy(t)  +  zsy(t)  -  fsy(t)  (2.24) 

(2.9)  can  be  rewritten  as 

Vl)  *  lSsw  +  ShJKy(t)  +  Z»yW  -  +  Chw  *  Ssw?sy ■»  *  SswkO  (2-25) 

where 


y,  m  _  r  mmin  +  *®  +  k0  i,z(,)<kO  mmin 
sy'1'  “  L  0  otherwise 


An  additional  assumption  has  been  made  here,  that  km)n<kQ<mmjn.  That  is,  it  is  assumed  that  the  value 
at  which  the  failure  rate  due  to  software  failures  starts  being  nonzero  (kQ)  lies  somewhere  between  the 
minimum  value  of  the  periodic  component  of  k(t)  (m^)  and  the  minimum  value  of  k(t)  (k^).  The 
reason  for  this  assumption  is  that  only  in  this  case  a  closed  form  expression  can  be  found  for  the  pdf 
of  the  time  to  system  failure.  Whether  this  assumption  holds  or  not  in  a  real  system  is  checked  later  in 
the  report. 


Again,  the  pdf  of  the  time  to  system  failure  is  given  by 


XSy<T>dT 


} 


(2.26) 


Using  the  results  of  Appendix  I,  the  following  expression  is  obtained 

.2,, 


3  (tj  9  ^  ^  (tj  Q(3aw  +  shw^  °2^  +  8hvPmsy  +  chw ’^sy^sw  +  shw,k  min^’  P sy^sw^O^^ 1  J  (2  27) 


and  the  following  expression  is  obtained  for  the  PDF  of  the  time  to  system  failure 
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<ssw  +  3hw)2'SVllK3sw+shw)'T,sy  +  chwP8y(ssw  +  shWkmin>Psy(8swMT| 


(2.28) 


Again,  to  completely  characterize  the  failure  process  of  a  real  system,  the  values  of  sw,  shw,  chw,  kQ 
need  to  be  estimated  from  the  history  of  failures  of  the  system. 


2.5  Parameter  estimation 

The  general  problem  of  parameter  estimation  for  doubly  stochastic  Poisson  processes  can  be 
stated  as  follows.  Let  {N(t);t>t0}  be  a  doubly  stochastic  Poisson  counting  process  with  intensity 

A(t,z(t),x).  where  z(t)  is  an  stochastic  process  and  x  *  (xyx2 . ,xm)  is  a  vector  of  unknown 

parameters.  The  occurrence  density  function  that  a  given  realization  of  the  process  has  a  failure  at 
time  tf  if  it  has  been  started  at  time  &  is,  given  by 

t * 

.  •/  X(f,Z(T).x]  dT 

p(tf|x,z(T),ts<T<tf)  *  A(tf,z(tf),x)  e  4  (2.29) 


If  we  observe  n  failures  at  times  tf.,, . tfn  with  associated  starting  times  tsr . tsn,  the  probability 

density  function  of  observing  such  set  of  events  is 


[*  i 

p(n)(tf1,...,tfn|x;z(T),tsi<T<tf.IVi)  =  II"=1  P(tsi)\(tfi,z(tfi),x)e4I  X(TZ(T)X  dT 


(2.30) 


where  P(ts,)  is  the  a  priory  probability  that  the  system  is  started  at  time  ts. .  Taking  the  expectation  with 
respect  the  statistics  of  z(t)  we  can  obtain, 


P(n,(tf1 . tfn|x,ts. . tsn)  =  e{  11"^  P(ts.)  A(tf.,z(tf,),x)e/  X(Tz(T)’^dT  } 


(2.31) 


The  maximum  likelihood  estimate  x'»(x\,x'2 . ,x’m)  of  of  if  in  terms  of  a  particular  realization  of 

the  process  is  by  definition  the  value  of  5^ that  maximizes  the  above  density  function  [Melsa  78].  That 

is,  p(n)(tf, . tfJx.tSj.i  ■  1, . n)  will  be  maximum  for  x  »  x ’.  In  the  cases  presented  in  this  report, 

closed  form  expressions  have  been  obtained  for  the  pdf  of  the  time  to  failure.  They  are  all  of  the  form 
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P(t) 


h(tf,x)  e 


-H(tt,t».xf 


(2.32) 


the  function  to  be  maximized  is  then 


. tfn)  =  n". ,  P(tsf)  h(tf.).x)  e'H(t,i,tSi^ 


(2.33) 


Note  that  this  problem  is  equivalent  to  minimazing  the  function 
Kx)  =  £"= ,  Hftfj.tSj.x )  -  ,  ln[h(tfjfx  )J 


(2.34) 


subject  to  the  constraints 


h(tfj,x*‘)>0  i  =  1, . ,n 


(2.35) 


Since  closed  form  expressions  for  the  components  of  x  at  the  minimum  are  not  generaly  available, 
this  is  a  typical  nonlinear  programming  problem,  subject  to  nonlinear  inequality  constraints.  Since  this 
problem  will  have  to  be  solved  every  time  that  the  failure  process  of  a  resource  has  to  be  modeled  for 
a  real  system,  particular  care  has  been  taken  in  finding  an  efficient  procedure  for  the  location  of 
minimums  of  functions  of  the  type  (2.34).  In  Appendix  II,  this  procedure  is  described,  along  with 
detailed  descriptions  of  all  the  functions  for  which  it  has  been  used  in  the  evaluation  of  maximum 
likelihood  parameters. 


2.6  The  implications  of  a  workload  dependent  model  in 
software  reliability  evaluation 

A  general  methodology  for  characterizing  system  reliability  in  terms  of  resource  utilization  functions 
has  been  presented  in  the  previous  sections.  First,  the  "typical"  measuring  conditions  have  been 
generalized  to  a  situation  in  which  workload  patterns  are  mapped  into  resource  utilization  functions, 
modeled  by  cyclostationary  processes.  Second,  by  considering  the  kernel  of  the  operating  system  as 
a  system  resource,  an  integrated  hardware/software  reliability  model  has  been  built.  The  assumption 
is  that  the  system  failure  rates  due  to  hardware  transients  and  software  errors  depend  on  the  kernel 
utilization  process.  Third,  since  the  functional  dependencies  of  the  failure  rates  due  to  hardware 
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transients  and  software  errors  with  respect  to  kernel  utilization  are  different,  it  is  in  principle  possible 
to  evaluate  the  relative  contribution  of  each  failure  rate  to  the  unreliability  of  the  total  system. 

Once  the  general  functional  dependency  between  failure  rate  and  kernel  utilization  has  been 
established,  all  it  is  needed  to  completely  characterize  a  real  system  is  to  evaluate  the  maximum 
likelihood  values  of  the  function  parameters.  But  all  it  is  needed  to  evaluate  the  maximum  likelihood 
values  of  these  parameters  is  a  history  of  system  failures.  Hence,  the  contribution  of  software  to 
system  unreliability  can  be  evaluated  just  knowing  the  times  of  a  set  of  system  failures,  without 
needing  any  information  about  how  the  kernel  has  been  written,  let  alone  how  many  bugs  remain  in  it. 
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3.  Failure  process  analysis  of  a  real  system 


In  order  to  verify  that  the  assumptions  stated  in  Section  2  lead  to  a  better  modeling  of  failure 
processes  than  other  models,  an  experiment  was  designed.  The  experiment  consisted  in  the 
adquisition  of  data  concerning  both  the  failure  process  and  the  use  of  a  general  purpose  time  sharing 
system.  The  system  choosen  was  the  CMU-A,  a  PDP-10  used  by  the  Computer  Science  Department  at 
Carnegie-Mellon  University  as  the  main  general  purpose  computational  tool.  The  system  consists  of  a 
KL-10  processor,  one  megaword  of  memory,  eight  disk  drives  totalling  1600  megabytes  of  online 
storage  and  two  magnetic  tape  drives.  The  system  runs  a  slightly  modified  version  of  the  standard 
TOPS-10  operating  system  [Bell  78a]. 

The  software  packages  used  to  instrument  the  experiment  are  illustrated  in  Figure  3-1 .  Information 
about  failures  is  obtained  from  an  online  error  log  file  maintained  by  a  system  program,  which  records 
the  information  produced  by  different  error  formatting  routines.  Entries  are  made  to  this  file  for  each 
hardware  error  detected  in  the  system,  for  system  reloads,  for  disks  performance  statistics,  and  so  on 
[Digital  78].  The  error  log  is  later  processed  by  SEADS,  a  FORTRAN  package  which  allows  to  list  the 
times  of  detection  of  errors  associated  with  a  particular  resource.  In  order  to  obtain  accurate 
information  about  the  use  of  the  system,  a  special  SAIL  program,  SYSMON,  was  written  that  samples 
the  values  of  30  system  parameters  twice  every  five  minutes,  the  two  samples  in  a  five  minutes  interval 
being  one  second  apart.  In  this  way,  I/O  traffic,  system  overhead  values,  etc.,  can  be  obtained 
averaged  on  a  one  second  interval  or  in  a  five  minute  interval  with  a  resolution  of  5  minutes.  The  files 
generated  by  SYSMON  are  later  processed  by  another  SAIL  package,  READSY,  which  computes  the 
periodic  component  and  autocorrelation  function  of  the  utilization  function  of  a  particular  system 
resource.  The  information  generated  by  SEADS  and  READSY  is  then  processed  by  an  APL  package 
(POWELL)  which  estimates  the  maximum  likelihood  parameters  of  the  pdf  of  the  time  to  failure  of  a 
particular  resource.  Finally,  in  a  separate  SAIL  package,  C2TST,  the  values  predicted  by  the 
cyclostationary  model  and  other  models  described  in  Section  4  are  compared  with  the  information 
stored  in  the  error  log  according  to  a  x2  goodness-of-fit  test 

The  operational  policies  regulating  the  use  of  this  system  at  CMU  make  it  a  good  starting  point  to 
check  the  validity  of  the  ideas  exposed  in  Section  2.  Its  steady  state  operation  during  weekdays  can 
be  understood  from  Figures  2-4  and  2-7.  Recall  that  this  figure  plots  the  sampled  values  of  the  fraction 
of  time  considered  to  be  operating  system  overhead  for  five  consecutive  weekdays.  The  value  of  the 
accumulated  overhead  time  is  obtained  by  executing  a  Monitor  Call  and  includes  the  time  spent  in 
clock  queue  processing,  short  command  processing,  swapping  and  scheduling  decisions,  and 
software  context  switching  [Digital  77].  This  value  does  not  include  Monitor  Calls  execution  nor  I/O 
interrupt  times.  It  is  not  exactly  the  time  that  the  system  is  executing  in  kernel  mode,  but  it  is  close 
enough  for  our  purposes. 
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Figure  3-1:  Software  packages  used  in  the  validation  of  the 
cyclostationary  modeling  methodology. 
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3.1  Probability  Distribution  Function  of  the  Time  to  Failure  of 
a  File  System 

Figure  2-2  shows  the  results  of  compiling  five  days  of  disk  utilization  samples  into  a  single  24  hour 
period.  Along  with  the  estimated  average,  this  figure  shows  the  function  mdk(t)  obtained  from  a  finite 
Fourier  series  expansion  (see  Appendix  I  for  details).  A  Fourier  series  expansion  is  a  least  squares  fit 
to  our  data  and  is  a  good  way  of  eliminating  the  "noise”  present  in  the  estimated  average  due  to  the 
finiteness  of  the  sample.  The  data  in  Figure  2-2  corresponds  in  fact  to  the  function  mdk(t)  in  Section 
2.3.  after  sampling  its  values  every  five  minutes.  After  substracting  from  b(t)  the  value  of  mdk(t),  the 
sampled  values  of  the  process  zdk(t)  are  avaflable  for  estimation  of  its  autocorrelation  function. 


Figure  3-2:  Estimated  and  approximated  Autocorrelation 
functions  of  the  file  system  utilization  process. 

Figure  3-2  shows  the  estimated  autocorrelation  function  for  the  process  z^ft).  From  its 
appearance,  it  seems  that  an  autocorrelation  function  of  the  form 


Ra(t) 


°i® 


+  <*2e 


(3.1) 


would  be  appropiate  to  approximate  the  real  autocorrelation  function.  The  noisy  appearance  of  the 
estimated  autocorrelation  function  is  again  a  consequence  of  the  finite  sample  size  available  for  its 
computation.  The  main  problem  in  the  evaluation  of  the  a,  and  /?( is  that  they  are,  in  principle,  very 
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sensitive  to  the  sampling  interval  (in  this  case,  5  minutes).  In  the  Appendix  III,  the  exact  procedure 
followed  to  evaluate  them  is  described  in  detail. 

With  the  autocorrelation  function  given  in  (3.1),  the  following  expression  is  obtained  for  the 
variance  a2(t) : 


,/« 


or2(t)  =  2 a1  /  (t-r)  e  r  1  dr  +  2a„  /  (t-r)  e  " z  dr 


/' 


P 1  fif 

and  substituting  (3.3)  in  (2.23)  we  obtain, 


Pdk(Kr)  =  1  -  0rik(r)  6 


dk' 


„  n  u  gdk1  f1  "01*.  “dk2  f, 
-(adkqdk1ffdk2),“^1e  1 


where  the  following  constants  have  been  defined 


“dk  *  Sdk™dk  +  Cdk  *  Pdk^Sdk,bmin^ 

«1  2 
“dkl  *  p  sdk 

ao  2 

ff  ..  .  s  — «-3 ,, 
dk2  p ^  dk 


(3.2) 


(3.3) 


(3.4) 


(3-5) 

(3.6) 

(3.7) 


The  hazard  function  is  given  by 


hdk(T) 


“dk  *  “dkl  I 


3  0dk(T) 


^dk(T)  3t 


(3.8) 


The  statistics  of  the  time  to  failure  for  a  doubly  stochastic  Poisson  process  when  the  intensity 
process  is  a  cyclostationary  process  are  then  equivalent  to  the  statistics  of  a  non  homogeneous 
Poisson  process  with  hazard  function  given  in  (3.8).  Although  impressive,  this  hazard  function 
reduces  to  a  constant  term  plus  a  periodic  component  plus  an  exponentially  decreasing  term.  Note 
that  neglecting  the  periodic  component,  this  hazard  function  is  exponentially  decreasing  with  the 
following  extreme  values 


FAILURE  PROCESS  ANALYSIS  OF  A  REAL  SYSTEM 


33 


Figure  3-3:  Hazard  function  of  the  equivalent  non  homogeneous 
Poisson  process  characterizing  the  statistics  of  the  time  to  failure 
of  a  file  system. 


V°>  *  «dk  (3-9) 

hdk(°°)  *  “dk-^dki'^kz  (ai°) 


as  shown  in  Figure  3-3. 


3.2  The  Probability  Distribution  Functions  of  the  Time  to 
System  Failure 


The  periodic  component  of  the  Kernel  utilization  process,  m  ft),  has  been  shown  in  Figure  2-5. 

Figure  3-4  shows  the  autocorrelation  function  of  the  process  z^(t),  suggesting  again  an 

sy 

approximation  of  the  form  given  in  (3.1).  The  following  expression  is  then  obtained  for  the  PDF  of  the 
time  to  system  failure 


P^Kt) 


1  -*sy(T)e 


<w<v* 


Pi  P  2 


(3.11) 


where 
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Figure  3-4:  Estimated  and  approximated  Autocorrelation 
functions  of  the  kernel  utilization  process. 


asi  *  (3sw  +  ^w^sy  +  Chw  *  Pay^sw  +  Shw’^min^  *  Psy^sw'^ 
%1  *  ^(S3W  +  Shw)2 

asy2  ~  (®sw  +  W 
The  hazard  function  is  given  by 


VT) 


“sy  '  ffsy1 1 


a- 


sy2l 


K{7) 


_9^a[r) 


3  T 


(3.12) 

(3.13) 

(3.14) 


(3.15) 


3.3  Simplified  expressions  for  known  starting  time 

All  the  expressions  given  in  Sections  3.1.  and  3.2.  have  been  obtained  after  computing  the 
expectation  for  all  possible  values  of  the  starting  time  in  a  one  day  period.  If  the  system  starting  time  is 
known,  different  expressions  are  obtained.  The  only  differences  between  the  PDF  of  the  time  to  failure 
with  known  starting  time  and  the  PDF  averaged  over  a  one  day  period  is  that  the  function  ftr) 
becomes  a  constant  equal  to  one  and  that  the  a  term  in  the  exponential  is  slightly  different.  In 
particular,  for  the  case  of  the  file  system  failures, 
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Pdk(KT|ts)  *  1  •  a# 


a'dk  ■  <aV<W<W*  ■  ^2‘l 


(3.16) 


where  P£)k(KT|ts)  is  the  probability  that  a  failure  will  be  detected  before  time  r  +  ts  given  that  no  failure 
had  been  detected  at  time  ts,  and 


a’dk  =  sdk^Mdk^T  +  '  Mdk^te^ 

(3.17) 

a  'dk  =  Cdk  ‘  Pdk(sdk’bmin^ 

(3.18) 

Mdk  ■/  mdk<T) dT 

•'ft 

(3.19) 

o 


The  hazard  function  for  known  starting  time  is 


VtN  =  sdkmdk(ts  +  r)  4-  a"dk-<rdk1[l  -e^1*]  -«xdk2[l  -e^2'] 


(3.20) 


Similar  expressions  can  be  derived  for  the  distribution  of  the  time  to  system  failure. 


3.4  Distribution  functions  of  the  time  to  system  failure  due  to 
software  and  of  the  time  to  system  failure  due  to 
hardware  transients 


Once  the  values  of  sM,  shw,  chw,  kQ  are  known,  it  is  straightforward  to  derive  an  expression  for  the 
PDF  of  the  time  to  system  failure  due  to  hardware  transients.  Repeating  the  derivation  described  in 
Section  2.4.  with  k0*0’  the  following  expression  is  obtained 


’Eahw'ffhwTffhw2^* ' 


1M1 

fil 


•ft  2 * 

[i-e  2] 


(3.21) 


where 
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ahw  3  ®hw^*sy  +  Chw "  Psy^hw’^min^ 

J1l  2 

ffhw1  3  p  ®hw 
a2  2 

°hw2  3  Q  Shw 


(3.22) 

(3.23) 

(3.24) 


In  the  general  case,  to  obtain  the  PDF  of  the  time  to  system  failure  due  to  software  errors,  a  similar 
equation  to  (3.16)  would  be  obtained,  but  with  the  following  parameters 


asw  ®sw^*sy  '  ^sy^sw’^min^  "  Psy^sw'^O^ 


asw1 

asw2 


^ls2 

^2.s2 

Po  *" 


(3.25) 

(3.26) 

(3.27) 


Note  that  if  the  system  failure  processes  due  to  software  errors  and  hardware  transients  are 
considered  to  be  nonhomogeneous  Poisson  processes,  each  with  a  PDF  of  the  form  (3.16),  the 
superposition  of  both  processes  (i.e.,  the  process  obtained  by  adding  the  hazard  functions  of  the 
software  failure  process  and  the  hardware  failure  process)  is  not  equal  to  the  total  system  failure 
process,  whose  PDF  is  given  by  (3.11).  This  is  because  they  are  not  statistically  independent.  Indeed, 
both  failure  processes  have  a  common  cause,  the  utilization  process  of  the  kernel  of  the  operating 
system. 

Table  3-1  gives  the  maximum  likelihood  values  of  sgw,  shw,  chw,  and  kQ  for  the  CMU-A,  along  with  the 
value  of  mgy.  Note  that  since  the  value  of  KQ  is  larger  than  mgy,  expressions  (3.22)  thru  (3.25)  may  not 
be  valid.  The  correction  term  p(ssw,kQ)  has  been  computed  assuming  that  K0<«thgy,  condition  that  the 
maximum  likelihood  value  of  K0  does  not  verify.  In  fact,  if  K0>rftgy  the  probability  density  function  of 
the  time  to  failure  due  to  a  software  error  degenerates  into  an  exponential,  such  that 

P^Kt)  »  1  -  e  Wo*  (3.28) 

the  PDF  of  the  time  to  system  failure  due  to  hardware  transients  being  given  in  (3.21). 

Figure  3-5  shows  the  relationship  between  the  instantaneous  value  of  the  system  failure  rate  and 
the  software  and  hardware  components.  Note  how  the  software  failure  rate  is  zero  for  a  wide  range  of 
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Parameter 

Value 

Ssw 

®hw 

0.158 

0.086 

chw 

0.0079 

k 

K0 

0.225 

IT) 

0.19 

sy 

Table  3-1:  Maximum  likelihood  values  of  the  coefficients 
defining  the  relationship  between  kernel  utilization  and  system 
failure  rate. 

values  of  k(t),  but  that  its  slope  is  larger  than  the  slope  of  the  failure  rate  due  to  hardware  errors. 
Figure  3-5  thus  suggests  that  to  assume  a  linear  relationship  between  the  system  failure  rate  due  to 
software  errors  and  kernel  utilization  may  be  an  oversimplification.  In  fact,  it  seems  reasonable  to 
expect  the  probability  of  observing  a  software  error  to  increase  with  the  length  of  time  that  the 
software  is  exercised  and  with  a  "stress"  factor  depending  on  the  apparent  complexity  of  the  input 
data  to  be  processed  at  a  given  time.  In  this  case,  perhaps  a  higher  degree  polynomial  would  better 
describe  the  relationship  between  the  software  failure  rate  and  software  utilization. 
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4.  Discussion 


Although  the  assumptions  made  in  Section  2  are  less  restrictve  than  the  usual  assumption  of 
modelling  the  failure  process  with  a  constant  failure  rate,  the  validity  of  the  methodology  presented 
here  can  be  asserted  only  by  comparison  with  the  behavior  of  a  real  system  and  contrasting  the 
results  that  would  be  predicted  by  traditional  models.  This  is  the  subject  of  Section  4.1.,  where  the 
results  given  in  Section  3  are  compared  with  the  values  predicted  by  assuming  either  an  exponential 
distribution,  a  Weibull  distribution,  and  a  periodic  distribution  for  the  time  to  failure.  In  section  4.2.  an 
explanation  is  given  for  why  the  apparent  failure  rate  is  decreasing,  and  finally  in  Section  4.3.  some 
preliminary  conclusions  are  summarized. 


4.1  Comparisons  with  other  models 

The  more  widespread  model  used  to  characterize  the  failure  process  of  digital  computers  assumes 
the  failure  process  to  be  a  homogeneous  Poisson  process.  The  PDF  of  the  time  to  failure  is  then  given 
by 


Xr 

Pe(t<r)  =  1-e  a  (4.1) 

where  is  the  (constant)  failure  rate.  The  maximum  likelihood  estimate  of  Xa  is  obtained  simpy  by 
dividing  the  time  that  the  system  has  been  operational  by  the  number  of  failures  reported.  All 
functions  and  parameters  related  to  this  model  will  be  noted  with  subindex  "e"  and  from  now  on  this 
model  will  be  referred  to  as  the  exponential  model. 

However,  empirical  studies  [McConnel  79a],  {Wagoner  73]  have  shown  that  a  Weibull  distribution 
gives  much  better  goodness  of  fit  to  experimental  data  than  a  simple  exponential.  The  Weibull  PDF  is 
given  by 


Pw(t<r)  •  1-e 


(XwT)aw 


(4.2) 


The  Weibull  distribution  is  characterized  by  two  parameters  :  Xw  ,  the  -  ale  parameter,  and  aw  ,  the 
shape  parameter.  For  1,  the  Weibull  distribution  degenerates  to  the  exponential.  For  aw>l ,  the 
Weibull  distribution  has  an  increasing  failure  rate.  A  decreasing  failure  rate  corresponds  to  °w<1 ■  AH 
reports  published  to  date  claim  that  a  decreasing  failure  rate  Weibull  distribution  fits  experimental 
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data  much  better  than  a  plain  exponential  model.  Numerical  procedures  have  been  developed  to  find 
the  maximum  likelihood  estimates  of  Xw  and  aw.  These  procedures  are  based  on  the  works  of 
[Thoman  69,  Berger  74,  Romano  77]  and  FORTRAN  programs  implementing  them  are  given  in 
[McConnel  79b]. 

A  workload  dependent  model  has  been  presented  in  [Butner  80].  A  linear  dependency  between 
failure  rate  and  workload  is  also  assumed.  The  workload  is  characterized  by  a  periodic  function  of 
time.  The  PDF  becomes  an  exponential  "modulated"  by  a  periodic  function 

•k  r  r  u  It) 

Pp(Kr)  =  1 -e  p  e  p  p  (4.3) 


where  Fp  is  defined  as  the  load  induced  failure  rate  and  U(t)  denotes  the  instantaneous  load  value. 
This  model  will  be  referred  to  as  t he  periodic  model,  all  its  parameters  having  the  subindex  "p".  Using 
the  notation  developed  in  Section  2,  this  is  equivalent  to  assume  an  utilization  function  u(t)  =  m(t), 
where  only  the  periodic  component  is  taken  into  account,  and  where  the  Gaussian  process  z(t)  has 
been  neglected.  In  this  case, 

P(t)  =  E{Ap(t)e  'j£  Xp(T)dT  }  (4.4) 

where 

Xp(t)  *  sp  m(t)  +  cp  (4.5) 


Note  that  (4.3)  and  (4.6)  are  equivalent.  In  Section  11.5.  the  equations  for  computing  the  maximum 
lykelihood  values  of  sp  and  cp  from  a  history  of  failures  are  given. 

Finally,  the  model  presented  in  this  report  will  be  referred  as  the  cyclostationary  model.  An 
expression  for  its  PDF  is  rewritten  here 
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Figure  4*1:  Two  alternatives  to  characterize  system  reliability. 
The  maximum  likelihood  values  of  the  hazard  function 
parameters  can  be  evaluated  from  the  resource  utilization 
functions  or  directly  from  a  history  of  failures. 
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P(Kt)  *  1  -  e 


<ac  +  ffc1  +  ac2>r  ■ 


:+'\ 


IsH. 

P, 


P2r  J. 

(1  ■  e  2  ]  +  In  4*  r) 


(4.7) 


Table  4.1.  summarizes  the  densities,  Reliability  and  Hazard  functions  for  each  of  the  four  models. 
Note  that  the  cyclostationary  model  has  both  an  asymptotically  decreasing  failure  rate  and  a  periodic 
component.  Qualitatively,  the  cyclostationary  model  seems  to  integrate  the  approaches  of  the  Weibull 
and  periodic  models.  Note  also  that,  for  the  case  of  a  file  system  failures,  only  two  parameters  need  to 
be  estimated  from  a  history  of  system  failures  (sdk  and  cdk),  the  other  parameters  ation  (4.7) 
being  measured  from  the  actual  system  behavior  (the  periodic  component  and  the  autocorrelation 
function  of  the  resource  utilization  process).  Since  the  cyclostationary  model  suggests  a  POP  of  the 
form  shown  in  (4.7),  it  is  conceivable  to  postulate  (4.7)  as  the  real  PDF  of  the  failure  process  and 
estimate  the  values  of  ac,  «rc1,  crc2,  /?v  directly  from  a  history  of  failures,  therefore  avoiding  the 
measurement  of  the  resource  utilization  functions.  Figure  4-1  describes  these  two  alternatives 
available  to  characterize  system  reliability.  In  Section  11.3.  the  equations  used  to  estimate  the  values 
of  these  parameters  drirectly  from  a  history  of  system  failures  are  given. 

The  fifth  distribution  in  Table  4.1.  is  a  simplified  version  of  the  distribution  obtained  with  the 
cyclostationary  model,  considering  only  one  exponential  in  the  hazard  function,  and  neglecting  the 
periodic  component  4>{t).  Section  11.4.  gives  the  equations  for  estimating  the  maximum  likelihood 
parameters  of  this  distribution  from  a  history  of  system  failures. 

Next,  quantitative  comparisons  using  data  of  a  real  system  are  in  order.  Table  4-2  show  the  results 
of  applying  a  x2  goodness  of  fit  test  between  the  actual  failures  observed  in  the  file  system  described 
in  section  3.1 .  and  the  distributions  predicted  by  the  above  four  models.  A  x2  value  smaller  than  x£  05 
(i.e.,  a  level  of  confidence  greater  than  0.05)  indicates  a  good  fit  between  predicted  and  observed 
behavior  and  suggests  the  acceptance  of  the  hypothetical  distribution  as  the  real  distribution 
underlying  the  failure  process. 

As  can  be  seen  from  Table  4-2,  only  the  cyclostationary  model  (both  with  direct  and  indirect 
evaluation  of  its  maximum  likelihood  parameters)  show  a  good  fit  with  experimental  data.  Neither  the 
exponential  nor  the  periodic  failure  rate  models  seem  to  be  able  to  describe  the  failure  process  with 
significant  accuracy.  The  simplified  cyclostationary  model  distribution  and  the  Weibull  distibution  are 
almost  in  the  border  line  of  acceptance.  Further  insight  can  be  gained  by  direct  comparison  of  the 
hazard  functions  of  the  above  four  models.  Figure  4-2  shows  the  hazard  function  of  the  above  four 
models  for  the  case  of  file  system  failures. 
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Table  4-3  shows  the  results  of  applying  a  x2  goodness-of-fit  test  to  the  four  models  in  the  case  of 
system  failures.  Note  that  for  the  periodic  model,  the  maximum  likelihood  values  of  the  coefficient  are 
such  that  the  proportionality  term  vanishes,  and  the  constant  term  equals  the  X  value  of  the 
exponential  model.  Although  all  models  give  a  level  of  significance  larger  that  0.05,  the 
cyclostationary  model  is  again  clearly  superior  giving  levels  of  significance  of  0.9.  The  hazard 
functions  of  the  four  models  for  the  case  of  system  failures  are  given  in  Figure  4-3. 

4.2  The  decreasing  hazard  function  paradox 

Tha  hazard  function  found  in  the  cyclostationary  model  presents  the  following  paradox  :  neglecting 
the  periodic  component,  expression  (3.15)  means  that  no  matter  at  what  time  we  start  observing  a 
system,  the  statistics  of  the  time  to  failure  are  equivalent  to  the  statistics  of  a  non  homogeneous 
Poisson  process  with  decreasing  hazard  function.  Since  the  hazard  function  is  roughly  the  rate  at 
which  failures  will  be  detected,  this  means  that  no  matter  at  which  time  we  start  observing  a  system 
the  apparent  rate  at  which  failures  are  detected  will  be  a  decreasing  funtion  of  time.  In  this  section, 
the  reason  for  such  surprising  behavior  is  investigated  and  explained. 

To  understand  the  decreasing  hazard  function  paradox,  start  with  the  simplest  possible  case. 
Assume  that  the  real  failure  rate  is  given  by  a  constant  plus  white  noise. 


X„i  (t)  =  m  +  Xl(t) 


(4.18) 


where  m  is  the  (constant)  mean  failure  rate  and  x^t)  is  a  stationary,  zero  mean  Gaussian  process  with 
autocorrelation  function 

R  x  <T>  -  <4-19> 

Vi  2 

and 


fi(T) 


Hm 

hlo 


1/h  O^r^h 
0  r>h 


Assume  that  m»W,  such  that  the  probability  of  Xs1(t)  being  negative  can  be  neglected.  The 
probability  density  function  of  the  time  to  failure  is  then  given  by 
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Exponential 


K 

Re(r)  =  e  e 
he(T)  =  Xa 


T 


Weibull 


(XwT)“w 
Rw(r)=.e  " 

®M|Xw| 

hw(r)  = 

(X  t)1aw 


Periodic 


Rp(t)  =  e  XpTe  FpU<T> 

Vt>  -  0,  *  fp^  3 


Cvclostationaiv 


Rc(t)  =  e 


<Xc  +  "cl 


*1 1  1 


heW  =  \  *  ^lTJ  -  (xji-e  ^  ^  ^ 


Simplified  CvclostationatY 


Rm(T)  -  e 


v  .J  T 


(4.8) 

(4.9) 


(4.10) 

(4.11) 


(4.12) 

(4.13) 


(4.14) 

(4.15) 


(4.16) 

(4.17) 


Table  4-1 :  Reliability  and  Hazard  functions  of  the  five  compared 
models. 
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Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

*0.05 

Level  of 
Confidence 

Exponential 

Xe  »  0.67 

7 

130 

14.067 

0 

Weibull 

\w  =0.91 
«w  =0.68 

8 

17.717 

15.507 

0.026 

Periodic 

Sp-125 

cp  *0.28 

12 

1007.194 

21.026 

0 

Cyclostat. 

8 

8.69 

15.07 

0.36 

Cyclostat. 

(Direct) 

a.  =2.13 

®1c-142 

=4.03 

0.59 

P2  =0.21 

6 

8.642 

12.592 

0.19 

Simplified 

Cyclostat. 

«c=1.69 

0,  =1.38 

8 

19.434 

15.507 

0.013 

Table  4-2:  Results  of  a  x2  goodness-of-fit  test  with  the 
Exponential,  Weibull,  Periodic,  and  Cyclostationary  models  for  file 
system  failures.  Only  the  Cyclostationary  model  gives  a  level  of 
confidence  greater  than  0.05.  The  Weibull  and  simplified 
cyclostationary  models  give  smaller  levels  of  confidence  but  close 
to  0.05.  The  hypothesis  that  the  time  to  failure  can  be 
characterized  with  Exponential  or  Periodic  models  has  to  be 
rejected.  The  data  used  was  obtained  from  five  weekdays  of 
system  operation  during  which  877  (transient)  failures  were 
detected.  The  MTTF  value  is  7  minutes.  The  file  system  is 
composed  of  8  RP06  disk  drives  totalling  1600  megabytes  of  on 
line  storage. 
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Figure  4-2:  Hazard  functions  predicted  by  Exponential,  Weibull, 
Periodic,  and  Cyclostationary  models  for  file  system  failures. 
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Model 

Parameter 

Values 

Degrees  of 
Freedom 

X2  value 

*0.05 

Level  of 
Confidence 

Exponential 

Ae  =0.0073 

8 

7.87 

15.507 

0.45 

Weibull 

Xw  =  0.0074 

“w  *a98 

7 

7.95 

14.067 

0.35 

Periodic 

S  =0.0 

Cp  =0.0073 

- 

- 

- 

- 

Cyclostat. 

ssw  =  0.158 
shw  =  0.0869 
chw  =  0.0079 
kQ  =  0.0357 

5 

1.61 

11.070 

0.9 

Cyclostat. 

(Direct) 

a  =0.013 
<r  =0.0054 

<7^  =0.0080 

*>f- 0.21 

P2  =  0.0041 

5 

1.66 

11.070 

0.9 

Simplified 

Cyclostat. 

ac  =0.014 
a.  =0.0064 
^=0.21 

6 

0.75 

12.592 

0.9 

Table  4-3:  Results  of  a  x2  goodness-of-fit  test  with  the 
Exponential,  Weibull,  Periodic,  and  Cyclostationary  models  for 
system  failures  (crashes).  Although  all  models  give  a  level  of 
confidence  larger  than  0.05,  the  Cyclostationary  model  shows  a. 
better  fit  to  real  data.  Note  that  for  the  Periodic  model  the 
maximum  likelihood  values  of  the  coefficients  is  such  that  the 
proportionality  coefficient  vanishes  and  the  constant  term  is  equal 
to  the  \  value  of  the  Exponential  model.  The  data  used  was 
obtained  from  29  weekdays  of  system  operation  during  which  60 
failures  were  detected  giving  a  MTTS  (Mean  Time  To  restart) 
value  of  1 1  hours. 
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Figure  4-3:  Hazard  functions  predicted  by  Exponential,  Weibull, 
Periodic,  and  Cyclostationary  models  for  system  failures. 
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gilt) 


a  /  -ml  x 

pi(t)  » -%{•  e  2  ) 


where  <r2(t)  is  the  variance  of  the  integrated  process 


,/V, 


az(t)  *  W.  /  (t-r)  5(t)  dr 


* 


such  that 


a  /  -mt  w,t  > 
Pi(t)  =  -£(  e  e  1  ) 


P(Kr)  =  1  -  e  1 


h^T)  =  m-W, 


(4.20) 


(4.21) 


(4.22) 


(4.23) 


(4.24) 


(4.25) 


This  failure  process  is  equivalent  to  a  homogeneous  Poisson  process  with  an  apparent  hazard 
function  not  equal  to  the  mean  failure  rate,  but  equal  to  the  difference  of  the  mean  failure  rate  minus 
the  "power"  of  the  noise,  Wr  The  reason  for  that  is  that  the  failure  rate  appearing  in  the  exponent  of 
an  exponential,  variations  above  the  mean  failure  rate  are  not  equally  weighted  with  variations  below 
the  mean.  In  fact,  the  variations  below  the  mean  value  are  more  heavily  weighted,  and  hence  the 
resulting  smaller  limiting  failure  rate  when  the  expectation  is  taken  over  all  possible  realizations  of  the 
failure  rate  process. 

Assume  now  that  the  real  failure  rate  is  equal  to  a  constant  plus  a  zero  mean,  stationary  Gaussian 
process 


X2(t)  -  m  +  x2(t) 


(4.26) 


but  that  now  the  autocorrelation  function  of  the  Gaussian  process  is  given  by 
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Rx  x  =  -^e 
x2*2  2 


P\r\ 


In  this  case, 


Defining 


w  -  JBL 

2  P 


the  following  expressions  are  obtained 

w  -/Jt 

3  i  -mt  W  t  — =-[l-e  ]\ 

P2(t)  =  -|-(e  e  2  P  ) 

w  -fir 

(m-WJr  -  -~-[1  ■  o  ] 

P,(t<r)  =  1-e  *  P 


h2(r)  =  m-W2[t-e^T] 


(4.27) 


(4.28) 


(4.29) 


(4.30) 


(4.31) 


(4.32) 


This  failure  process  is  then  equivalent  to  a  non  homogeneous  Poisson  process  with  an 
exponentially  decreasing  hazard  function.  For  r  =  0,  the  apparent  hazard  function  is  equal  to  the 
mean  real  failure  rate,  m.  For  r  ■  oo,  the  apparent  hazard  function  equals  the  same  value  that  had 
been  obtained  assuming  the  Gausian  process  to  be  white  noise.  And  as  r  increases,  the  failure  rate 
approaches  this  limiting  value  exponentially. 

Finally,  note  that  if  W1  =  W2,  the  system  with  white  noise  utilization  process  will  be  more  reliable 
than  another  system  having  a  utilization  process  with  autocorrelation  function  given  by  (4.27).  In  the 
case  of  white  noise,  the  system  reaches  the  minimum  value  of  its  hazard  function  at  t  =  0,  while  for  an 
autocorrelation  function  of  the  form  (4.27)  the  minimum  value  is  approached  exponentially.  Hence  a 
non  obvious  way  of  increasing  the  reliability  of  a  particular  resource  would  be  to  build  a  system  such 
that  the  utilization  process  of  that  resource  approaches  white  noise  as  much  as  possible. 
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4.3  Preliminary  conclusions 

It  has  been  shown  how  the  cyclostationary  model  is  capable  of  predicting  the  reliability  of  a  Time¬ 
sharing  system  in  steady  state  operation.  Workload  dependency  is  stated  explicitely  in  the  model,  by 
means  of  resource  utilization  functions.  System  reliability  is  evaluated  in  terms  of  utilization  of  system 
singularities  (i.e.,  the  Kernel  of  the  operating  system).  And,  in  general,  resource  reliability  is  evaluated 
in  function  of  the  utilization  of  each  resource. 

Sofware  and  hardware  reliability  can  be  evaluated  separately  merely  by  observing  a  history  of 
system  failures  and  some  knowledge  of  how  the  system  behaves  (periodic  mean  and  autocorrelation 
function).  A  linear  relationship  between  software  failure  rate  and  software  utilization  has  given 
somewhat  contradictory  results,  suggesting  that  perhaps  more  complex  relationships  need  to  be 
considered.  In  any  case,  it  has  been  shown  how  establishing  the  relationships  between  software 
failure  rate,  hardware  failure  rate,  and  kernel  utilization,  it  is  in  principle  possible  to  evaluate  the 
contribution  of  software  and  hardware  to  the  unreliability  of  the  total  system. 

Perhaps  one  of  the  more  important  results  is  that  the  probability  density  function  for  the  simplified 
cyclostationary  model  (having  a  single  exponential  in  its  hazard  function)  has  a  known  Laplace 
transform,  making  it  suitable  for  Markov  modelling.  Neither  the  complete  Cyclostationary,  nor  the 
Weibull,  nor  the  Periodic  models  lead  to  probability  density  functions  with  known  Laplace  transforms. 

From  a  user  viewpoint,  there  is  a  reinforcement  effect  between  workload  and  lack  of  reliability. 
Higher  workload  implies  that  the  Kernel  of  the  operating  system  has  to  take  more  decisions  per  unit 
time,  increasing  the  probability  of  a  system  failure.  Hence,  not  only  the  user  receives  less  CPU  cycles 
per  unit  time,  but  the  probability  that  these  cycles  will  become  useless  because  the  job  will  have  to  be 
restarted  also  increases. 

Hence,  high  reliability  seems  to  be  in  contradiction  with  other  performance  measures  (such  as  the 
maximum  number  of  jobs  allowed  to  be  simultaneoulsy  active). 

But  the  contradiction  between  reliability  and  other  performance  measures  seems  to  be  of  a  deeper 
nature.  In  [Spim  77]  several  paging  algorithms  are  described  and  modelled.  Page  faulting  in  a  virtual 
memory  system  can  be  described  as  a  stochastic  process,  and  a  usual  optimality  criteria  for  paging 
algorithms  is  how  well  they  are  able  to  perdict  future  system  behavior  given  past  and  present  system 
behavior.  This  is  exactly  the  information  given  by  the  autocorrelation  function,  and  in  [Spim  77] 
several  paging  algorithms  are  compared  in  terms  of  how  well  their  predictions  fit  the  autocorrelation 
function  of  a  real  page  faulting  process.  But  in  Section  4.2.  it  has  been  shown  that  a  way  of  improving 
reliability  is  to  have  white  noise  as  the  resource  utilization  process.  For  white  noise,  the  future  values 
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of  the  process  are  completelly  unpredictable  no  matter  for  how  long  has  the  system  been  observed 
and  no  matter  how  close  in  the  future  is  the  prediction  desired.  Hence,  for  an  optimally  reliable  paging 
system,  its  utilization  process  should  be  white  noise.  Since  future  system  behavior  would  be 
unpredictable,  an  optimum  paging  algorithm  under  these  conditions  would  just  swap  out  of  memory 
pages  at  random,  therefore  lowering  system  performance. 
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5.  Proposed  research 

5.1  On  the  linear  dependency  between  overhead  and  failure 
rate 

The  first  topic  to  be  investigated  in  depth  would  be  the  real  dependency  between  overhead  and 
failure  rate.  A  linear  relationship  has  been  assumed  for  the  cyclostationary  model,  and  any  other 
relationship  may  lead  to  hopelessly  complex  mathematical  problems  in  the  evaluation  of  the 
expectation  of  the  modified  Gaussian  process.  However,  there  is  always  a  possibility  that  other 
dependencies  may  be  more  accurate,  and  even  if  exact  expressions  for  the  distribution  of  the  time  to 
failure  cannot  be  obtained  for  dependencies  other  than  linear,  errors  due  to  a  linear  relationship 
assumption  should  be  understood. 

Since  a  failure  cannot  be  detected  if  the  system  is  not  used,  the  only  a  priori  assumption  that  seems 
reasonable  is  that  the  failure  rate  must  be  a  non-decreasing  function  of  the  system  overhead.  What 
needs  to  be  known  is  for  what  ranges  a  linear  dependency  is  accurate,  what  are  the  confidence 
intervals  that  can  be  obtained,  and  to  explore  the  possibility  of  characterizing  the  failure  rate  with 
relationships  other  than  linear  if  necessary. 


5.2  Generalization  to  systems  showing  a  non-cyclostationary 
behavior 

One  of  the  fundamental  assumptions  made  to  develop  the  cyclostationary  model  has  been  that  the 
system  overhead  could  be  approximated  by  adding  a  periodic  function  to  a  modified  Gausian 
process.  This  may  be  a  good  approximation  for  time-sharing  systems,  but  it  certainly  does  not  apply  to 
many  real-time  and  command  and  control  systems.  In  fact,  the  highest  demand  for  high  availability 
systems  comes  from  special  purpose  command  and  control  systems  like  the  ones  to  be  installed  in 
aircrafts,  missiles,  satelites,  and  so  on.  For  some  of  these  systems,  the  workload  can  be  modelled  by  a 
sequence  of  load  states.  If  the  exact  sequence  is  not  known  in  advance  but  the  possible  alternatives 
are  known,  the  instantaneous  value  of  the  mean  workload  could  be  modeled  by  a  semi-markov 
process  [Howard  71]. 

The  cyclostationary  model  would  then  evolve  to  a  model  in  which  the  instantaneous  mean  failure 
rate  is  not  a  periodic  function  of  time,  but  a  random  variable  whose  statistics  depend  on  the  mission  to 
accomplish.  This  new  model  would  have,  in  addition,  the  ability  to  incorporate  the  effect  of  permanent 
hardware  failures,  transient  hardware  failures,  and  software  failures.  In  fact,  most  performance- 
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reliability  models  presented  to  date  just  assume  that  in  the  presence  of  a  permanent  hardware  failure 
the  system  reconfigures  itself  and  continues  to  operate  with  a  possibly  different  computational 
capacity.  Whether  the  capacity  diminishes  or  the  workload  increases,  the  result  will  be  a  system 
operating  in  one  of  a  set  of  possible  states  for  a  given  period  of  time. 


5.3  Characterizing  total  system  performance 

Present  reliability  evaluation  tools,  such  as  Reliability  or  Availability,  are  felt  to  be  innadaquate  due 
to  the  large  gap  that  separates  say,  the  Availability  of  a  computing  facility  and  the  cost  that  has  to  be 
paid  due  to  of  lack  of  reliability.  Digital  computers  are  used  to  store  and  process  information,  and  the 
sooner  the  desired  information  is  available,  the  better  the  system.  The  occurrance  of  a  system  failure 
means  waiting  until  operation  is  restored,  bringing  the  machine  to  a  consistent  state,  possibly 
restarting  computations  that  were  interrupted  because  of  the  failure  and  (if  possible)  updating  the 
system  with  the  information  it  was  supposed  to  process  while  it  was  not  operational.  In  short,  it  means 
a  delay  in  obtaining  the  desired  information  and  an  added  cost  associated  with  the  extra 
computations  related  to  restoring  the  system  to  the  desired  state  after  the  failure  occurred. 

From  a  single  user  viewpoint,  a  failure  also  means  a  delay.  In  [Castillo  80}  the  expected  ellapsed 
time  required  to  complete  a  program  was  computed  under  rather  restrictive  assumptions,  but 
separating  the  "useful”  time  that  leads  to  program  completion  from  the  "useless”  time  due  to  lack  of 
reliability.  Hence,  a  possible  extension  of  the  methodology  presented  in  this  report  would  be  to  try  to 
caracterize  the  cost  associated  with  lack  of  reliability  from  the  resource  utilization  functions  of  each 
system  resource. 


5.4  System  design  optimization  criteria  derived  from  this 
model 

Finally,  it  has  been  described  in  Section  4  how  reliability  seems  to  be  in  contradiction  with  other 
performance  indicators.  If  it  is  assumed  that  the  performance  of  a  digital  computer  can  be 
characterized  by  a  vector,  each  component  measuring  a  different  aspect  of  performance  (for 
example,  throughput,  execution  rate,  reliability,  storage  capacity,  etc.)  the  arguments  exposed  in 
Section  4  seem  to  indicate  that  it  is  not  possible  to  raise  the  value  of  all  these  components  at  the  same 
time  (except  by  enforcement  of  fault-intolerance  in  each  resource).  Hence,  it  seems  in  principle 
possible  to  look  for  the  optimum  performance  point,  as  the  point  in  which  the  system  operates  in  a 
state  in  which  a  cost  function  associated  with  each  performance  measure  is  minimum. 
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I.  The  cyclostationary  Poisson  process 


i 


The  problem  is  to  obtain  an  expression  for  the  pdf  of  the  time  to  failure  of  a  doubly  stochastic  failure 
process 


P(t) 


-  E{\(t )el 


Mr)  dr  i 


(1.1) 


where  the  time  origin  is  assumed  to  be  t  =*  0  and 


\(t)  =  au(t)  +  b 

=  am(t)  +  b  +  az(t)  -  az(t) 


(12) 


m(t)  being  a  periodic  function  of  time,  z(t)  a  stationary  Gaussian  process  independent  of  m(t),  and 


z(t)  »  { 


*(t> 

0  otherwise 


(13) 


Define 


t 

X(r)  dr 


(14) 


It  is  shown  in  (Saleh  74]  that  since 


t 


Ml)  .  AA(D 


(1.1 )  can  be  rewritten  as 


i 


(15) 


J 


1.1  The  deterministic  part 

Let  us  examine  now  the  first  expectation.  m(t)  is  a  periodic  function  of  time.  However,  the  time 
origin  is,  in  principle,  unknown.  The  probability  density  function  (pdf)  given  here  is  going  to  be 
compared  with  the  estimated  pdf  of  a  real  system.  In  an  observed  pdf,  each  system  failure  will  be 
associated  with  a  time  origin  corresponding  to  the  moment  at  which  the  system  was  started.  However, 
if  the  failure  rate  is  a  time-dependent  function,  it  cannot  be  assumed  that  the  system  will  be  started 
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with  equal  probability  at  any  time  during  a  one  day  period.  In  fact,  the  system  will  be  started  more 
often  in  the  periods  of  time  in  which  it  fails  more.  Since  m(t)  is  precisely  the  periodic  component  of  the 
failure  rate, 


(1.13) 


(1.14) 


and  T  is  the  period  of  m(t). 

To  evaluate  the  expectation  in  (1.13),  m(t)  will  be  approximated  by  a  finite  Fourier  series  expansion. 


m(t)  =  m  +  cn  sin  (nut  +  <pn) 


where  the  following  constants  have  been  used 


«  »  2B- 

T 

f  T 

(1.16) 

fll  = 

T  , 

/  m(t)  dt 

0 

(1.17) 

Cn  “  <an 

,  1/2 
+  b^) 

(1.18) 

a_ 

®  *  arctan  -**• 
n  b 

n 

0-19) 

f  T 

a.*-1! 
"  T  J 

f  m(t)  cos(nwt )  dt 

0 

(120) 

t  T 

b  ■  , 

n  t  J 

f  m(t)  sin(nwt )  dt 

0.21) 

l 
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Taking  into  account  this  approximation,  we  can  obtain  another  representation  for  M(t).  First,  note 
that 


/ 


r +t 

m(u)du  «  ami 


m  t  +  ]CN  <  — cosfnwr  +  <p  )  -  cos(nw(r  + 1)  +  <pn)  J 
n  = 1  nu 


(1.22) 


affi  t  +  ga(t,r) 


(1.23) 


where  gjt.r)  is  defined  as 


g  (t,r)  =  i  cos(nwr  +  <p  )  -  cos(nw(r  +  t)  +  <pn)  ] 

e{  e  +  M  }  can  now  be  written  as 

r  aM(t)  +  w  -1  (am  *  b)t  f  T  .  '»a(tT) 
e{  e  j  *  e  /  m0(r)e  8  dr 


*a(t)e 


(am  +  bH 


where 


f T  , 

I  m0(r)e 


dr 


is  aiso  a  periodic  function  of  t 


1.2  The  stochastic  part 


The  problem  is  now  to  compute 


.  £  ea[Z(t)-Z(t)]  j 


(124) 


(1.25) 


("•26) 


(127) 


(1.28) 
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Since  z(t)  is  a  zero  mean  Gaussian  process,  aZ(t)  wiii  be  another  zero  mean  Gaussian  process  with 
variance  (see  [Papoulis  65],  pp.  323-325) 


<r2(t)»2a2/  f 

JQ  JQ 


0-29) 


where  RZ2(t1,t2)  is  the  autocorrelation  function  of  the  process  z(t).  If  in  addition  z(t)  is  stationary, 


a2(t)  =  2a 2  f  [  (t- 
Jo  Jo 


T)  Ra(r)  dr 


(130) 


The  main  problem  in  the  evaluation  of  (1.28)  is  the  evaluation  of  the  statistics  of  the  process  z(t)  after 
integration,  that  is,  the  statistics  of  the  excess  area  of  a  Gaussian  process  above  a  given  level  c  in  [0,t] 
(see  Figures  1-1  and  1-2).  The  problem  of  level  crossing  for  Gaussian  processes  has  been  extensively 
treated  in  the  literature.  In  particular,  in  [Stratonovich  67]  this  problem  is  studied  in  detail,  and 
expressions  are  given  for  the  duration  of  peaks  above  a  given  level,  and  the  excess  area  under  such 
peaks.  This  is  exactly  what  is  required.  The  following  is  a  summary  of  Chapter  1-3,  Vol  II,  of 
[Stratonovich  67],  The  derivation  will  only  be  outlined  here  with  remarks  on  the  assumptions  used 
and  the  results  obtained. 


Figure  1-1 :  A  possible  realization  of  z(r)-z(r) 
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•  Figure  1-2:  A  realization  of  z(T)._The  shadowed  area  corresponds 

to  the  integral  of  z(r)  from  0  to  t,  Z(t) . 

Z(t)  is  a  random  variable  whose  exact  statistics  may  be  impossible  to  compute.  Its  value  for  a  given 
realization  of  the  process  z(t)  is  equal  to  the  addition  of  the  excess  areas  of  all  peaks  of  z(t)  above  a 
level  c.  It  is  shown  in  [Stratonovich  67],  p.  59,  that  if  the  duration  of  the  peaks  is  much  smaller  than  the 
time  between  peaks,  the  time  between  upcrossings  (downcrossings)  can  be  approximated  by  an 
exponentially  distributed  random  variable.  The  probability  of  having  k  peaks  in  [0,t]  is  then  given  by 

P(n  =  k)  =  e’^  (1.31) 


where  rj  is  the  mean  number  of  peaks  per  unit  time.  In  the  case  of  a  stationary  Gaussian  process,  jj  is 
given  by  ( [Stratonovich  67],  p.  7) 


V 


(132) 


where 


d\jr) 


dr 


T«0 


(133) 
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These  expressions  are  valid  only  for  "smooth"  processes.  Qualitatively,  the  smoothness  condition 
means  the  z(t)  must  be  differentiable  and  close  to  a  straight  line  segment  during  a  sufficiently  small 
time  interval.  In  the  case  of  Gaussian  processes,  this  condition  means  that  R  (t)  must  have  a  finite 
second  derivative  at  the  origin.  Assuming  that  this  condition  is  satisfied,  it  is  required  now  to 
characterize  the  excess  area  under  each  peak. 


If  the  duration  of  peaks  is  small  and  the  process  is  smooth,  the  second  derivative  of  z(t)  can  be 
considered  constant  over  the  duration  of  a  peak,  the  peak  can  be  assumed  to  be  of  parabolic  shape, 
and  the  excess  area  depends  only  on  the  value  of  the  first  derivative  of  z(t)  at  the  time  of  crossing  the 
level  c.  Under  these  assumptions,  the  following  expression  can  be  obtained  for  the  probability  density 
function  of  the  area  under  a  peak  ( [Stratonovich  67],  Vol  II,  pp.  68-72) 


P(S)  =f  (R2)1/23 


2/3 


V 


s1/3e 


.-LrJfii 

2  2ff3 


(R^a]273 


(1.34) 


The  value  of  Z(t)  is  given  by 


Z(t)  *  s1  +  . +  sk  (1.35) 

where  each  of  the  Sj  has  density  given  in  (1.34)  and  k  is  another  random  variable  with  density  given  in 
(1.31).  An  exact  evaluation  of  (1.28)  would  require  knowledge  of  the  joint  probability  density  function  of 
Z(t)  and  Z(t).  Since  this  impossible  to  obtain,  the  approximation  will  be  made  that 


Z(t)  =  k  E[s] 


(1.36) 


where  k  is  the  number  of  peaks  in  [0,t]  and  E[s]  is  the  mean  peak  area.  In  this  case,  Z(t)  and  Z(t)  are 
independent  random  variables  and 


a(Z(t)  -  Z(t>] 


•aZ(t) 


0-37) 


Since  Z(t)  is  a  zero  mean  Gaussian  variable  with  variance  <j2(t)  given  in  (1.31), 
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-aZ|t) 


.  ,  «  A 

i  *  /  e  ®  2o2tt)  dZ(t) 

2w  JCT(t)  y.00 


aVft) 


=  e  2 


and 


(1.38) 

(1.39) 


r  aZ(t)  -«  TpOO  ak£[sj 

E\_  e  }  =  2-k=0P(n  =  k)e 

(1.40) 

^(eaE[a].ij, 

(1.41) 

a  e 

p(a.c)t 

(1.42) 

a  e 

where  p(a,c)  is  the  correction  factor 

aEJs) 

p(a,c)  *  rj(e  -1) 

(1.43) 

Note  that  both  ij  and  E[s]  depend  on  the  value  of  the  level  c.  The  value  of  E[s)  can  be  computed  from 
(1.34). 

E[s]  *  .  (1.44) 

„  «7rlV!.  ■  (1.45) 

fa'2 

If  aE[s]<«1  (that  is,  if  the  value  of  c  is  much  larger  than  the  variance  a2),  the  approximation  can  be 
made  that 


provided  that  r^<«1.  In  this  case,  the  variance  if  the  integrated  process  becomes,  according  to 
(130) 
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and  the  expectation  (1.1 1)  becomes 


,  2  2  fit 

(am  +  bp(a,c)-a2-2-]t--2-[l -e  ] 
0a(t)  e  P  0  2 


(1.51) 


(1.52) 


and  p(a,c)  is  given  in  equation  (1.48).  The  fact  that  the  term  (R2)1/2  cancels  in  the  approximated  value 
of  the  correction  factor  p(a,c)  has  an  interesting  physical  meaning.  R’^ft)  is  the  autocorrelation 
function  of  the  Gaussian  process  that  would  be  obtained  at  the  output  of  a  low-pass  filter  with 
bandiwth  l/rc  when  its  input  is  the  process  z(t).  The  fact  that  the  value  of  p(a,c)  is  independent  of  R2 
means  that  the  area  generated  under  the  peaks  of  z(t)  per  unit  time  is  independent  of  the  bandwith  of 
this  filter.  The  area  generated  per  peak  diminishes  for  higher  bandwith,  while  the  number  of  peaks  per 
unit  time  increases  with  the  bandwith.  Fortunately,  these  to  effects  cancel  each  other,  such  that  the 
area  generated  under  the  peaks  per  unit  time  remains  a  constant,  independent  of  the  process 
bandwith. 


subject  to  the  constraints 
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1)  a>  0 

2)  0>O 

3)  y>0 

4)  a  -  r  >  0  (11.6) 


Although  this  is  a  typical  nonlinear  programming  problem  For  which  general  methods  are  available 
(see,  for  example,  [Bazaraa  79]),  the  problem  of  minimizaing  functions  like  (11.5)  present  two 
peculiarities.  First,  every  time  that  the  function  has  to  be  evaluated  it  requires  to  compute  the  sum  of  n 
terms,  n  being  the  number  of  observed  failures.  In  the  case  of  the  file  system  described  in  section  3.1., 
the  number  of  observed  failures  in  five  days  of  system  operation  is  877.  Hence,  function  evaluation 
(or  gradient  evaluation)  is  computationally  expensive  and  an  efficient  method  will  be  needed. 

Second,  although  several  efficient  methods  are  known  for  minimization  subject  to  nonlinear 
inequality  constriants,  these  methods  usually  assume  that  the  constraints  are  external  to  the 
mathematical  statement  of  the  problem,  and  that  the  objective  function  can  in  fact  be  evaluated 
outside  the  constraints.  This  is  not  the  present  case.  The  fourth  constraint  in  (11.6)  must  be  satisfied 
plainly  because  the  objective  function  (11.5)  does  not  exist  unless  its  parameters  satisfy  this  constraint 
(in  the  sense  that  the  logarithm  of  a  negative  number  does  not  exist).  Indeed,  the  fourth  constraint 
says  that  the  hazard  function  must  be  positive,  and  a  solution  that  does  not  satisfy  the  fourth 
constraint  in  (11.6)  invalidates  the  existence  of  the  objective  function  itself.  Hence,  minimization 
algorithms  that  require  the  evaluation  of  the  objective  function  outside  the  constraints  cannot  be 
used. 

For  this  reason,  the  first  algorithm  to  be  used  to  find  a  minimum  of  functions  of  the  type  (11.5)  was 
the  gradient  projection  method  of  Rosen  [Rosen  60].  This  algorithm  follows  a  steepest  descent 
direction  until  one  or  several  constraints  are  violated,  projecting  then  the  gradient  on  the  subspace 
defined  by  the  active  constraints.  This  method  has  proven  to  be  very  slow  with  the  functions  tested  in 
this  report.  After  some  experimentation,  the  fastest  algorithm  found  has  been  a  slightly  modified 
version  of  a  variable  metric  algorithm  proposed  by  Powell  [Powell  78].  The  original  Powell  algorithm 
occasionally  requires  the  evaluation  of  the  objective  function  outside  the  constraints  and  has  been 
modified  such  that  the  maximum  step  size  at  each  iteration  never  leads  to  a  point  outside  the 
constraints.  The  modified  algorithm  converges  more  slowly  that  the  original  Powell  algorithm,  but  for 
all  the  cases  in  this  report,  the  minimum  has  been  found  in  less  than  30  iterations,  which  is  a  very 
good  rate  of  convergence  given  the  functions  under  consideration. 

The  algorithm  has  been  implemented  as  an  APL  package  that  requires  the  definition  of  the 
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objective  function,  the  gradient  of  the  objective  function,  the  constraints  and  the  gradients  of  the 
constraints.  The  following  sections  desfcribe  each  function  in  detail,  providing  a  notational  dictionary 
consistent  with  the  programs  used. 
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11.1  File  System  Failures  (Cyclostationary  Model) 


Function  to  be  minimized 


l(sak,cak)  .  ,  sak[M(tl()-u(tS|)l  ♦  [cak-p(sak.bak„)-s|k[-||--^]w1-la,) 


P\ 

Pi 

-5Z"=1  InCs^mttfj)  +  cdk  -  p(sdk,bmin) 


where 


>3/2._/0.1/2  bmin 


Pdk^Sdk,bmin^ 


Definitions 


K,  *  M(tf  )-M(ts.) 
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(a,  +  a9)3/2(ff/2)1/2 


Kj  ■  *2, 

S-2T..S 

k4-E".,k. 


_  min 

e 
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X1  =  Sdk 


X2  *  Cdk 


Objective  Function 


l(x)  *  XlK,  +.  [x2-x1K6-X2[-|l+-|2.]]  K2  +  x?[^  +  -^] 
- [  XlKs_  +  x2- x^g- X2[K3j  +  K^]] 


Gradient 


J2l  =  «  -KK  -  2x  f  +  -^2."l  K  +  2x  T  -3-  +  — ”1 

aXi  K1  K6K2  “iL  /?1  +  p2  -I  *2  +  ^iL  ^  +  fi2  -I 


j-n  ■yv^tyv 

x1K5j  +  xaX1K8'tK3i  +  K4.lx1 


x1K5i  +  X2x1KeIK3i+K4|]X1 


Constraints 


C^x)  »x,>0 
C2(if)  -  x2>0 

C3(x)  »  x^intKg}  +  x2-x1K8-x^[maX{K3)  +  K4  }]>0 


Gradient  of  the  constraints 
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min{Ks}  -  2x1  max{Ka  +  K4  } 


PARAMETER  ESTIMATION 


11.2  System  Failures  (Cyclostationary  Model) 


Function  to  be  minimized 


'tesw-Shw'SwV  *  t  vMtfp-MIts,)! 


+  E"= ,  [chw  -  p(Ssy,kmin)  -  P(s3w.k0)  -  Ssyt^”  7^]  ^i*13 


Pi 

Pi 

•  Z", ,  ln[s3ym(tfj)  t-  Csy  -  P(ssy,kmjp)  -  ^ 


where 

3sy  *  Ssw  +  ®hw 
Definitions 


K,  •  M(tf.)-M(tS,) 
n 

Ka.  - 


*s  *  ^i=i  Ki, 
*2  *  ^l»1  *2, 
*3  *  ^l«l  *3, 
*4  *  ^l«l  *4, 
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X2  *  ®hw 


x3  3  Chw 


X4  =  k0 


K„  as  defined  in  Section  11.1 . 

O 


(aJtajP(W^ 


e  2(a,+a^ 


Objective  Function 


‘)  =  (x^JK,  +  [x3-(Xl  +  x^Kg-x^-fx^x/f^-T-g.]]^ 


+  (x1+x2)2[^-  +  -^] 

-  Hi , ^  *  x2)K  +  x3  -  (x,  +  x2)K6  -  XlK?  -  (x,  +  x2)2[K3)  +  Kj] 


Gradient 


A  =  Kt  -  K6K2-K7K2-2(x1  +  x2)[  ^  1  K2  +  2(xi  +  X2>E  f;  +  ^ 

‘3l  (x1^2>K51  +  X3-(x1+X2)K6-X1K7-tK31  +  K4i1(X1+X/ 

A  a  Kl  -  K8K2. 2(x,  +  x2)[  A.  +  |2.  ]  K2  +  2(x,  +x2)[ 


_  K-  •  Kfl  ■  2x.{K«  •*■  K^J 

y«  _ 5i  6  1  3i  *i _ 

1-1  (x1+»2)K5.  +  x3  (x1  +x2)Kg-  ■  [K3i  +  X4Kxi  +  x2> 
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_2L=K-£n  _ 3 - 

0X3  2  1=1  (x1^2)K5j^x3-(*1+x2)K6-*1K7-tK3j+K4jKx1<.x/ 


g.  „r2  «4_,  V'"  x)K7«2/x4),(x4/(g1^  a#)  _ 

*  as  *XiKJJL  +  '  ]  ^  2 

9x4  x4  a1+tt2  -  (x1+x2)K5ji.X3-(x1+x2)K6-x1K7.lK3_+K4|](x1+x2) 


Constraints 


0,(5?)  *  x,>0 
C2(5?)  =  x2>0 
C3(5?)  =  x3>0 
C4(5?)  =  x4>0 

Cs(5?)  =  (x,  +  x2)min{K5}  +  x2  -  (x,  +  x2)K6  -  x,!^  -  (x,  +  x/tmaxfl^  +  K4_}]  >  0 


Gradient  of  the  constraints 


min{K5  }  -  K8  -  2x,  max{K3  +  K^} 


»  1 


] 
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11.3  Cyclostationary  model  -  Direct  Method 


Function  to  be  minimized 


«« 


P-\  H2 


Tpn  r  tsi>i  "I 

/  _  t  In  |_a  •  e1M  -e  ]*ff2[l-e  JJ 


Definitions 

x,  =.  a  x2  =  o,  x3  =  p,  x4  =  o2  x5  =  p2 
K2  and  K2  as  in  section  11.1 . 


Objective  Function 


la?) .  <x,.x,*4,k2 +^e:„  p -.'Ai  *  *et.,  p 


E:.,m[x,-«2|.-e^l-*4t.-e,!K20] 


Gradient 


-  Ka-E2,,, 


_L 


•XaKa 

tM1-*  ‘i-M1® 


x5K2„ 


J-ia 


x3K2, 


•x,K,  ‘*5^2 . 

x1  •  x2[i  •  e  3  Zi]  •  x4[l  •  e  ij 


PARAMETER  ESTIMATION 


*sr.,- 


"x3^o 

x2K2.  6  1 
X3K2. 


"x5K2., 


xrx2[l  e  J  |]  •  X4[1  ■  a  J  "I] 


Constraints 

C,(x)  =  x,>0 

C2(x )  =  x1  -  x2  >  0 
C3(x)  =  x3>0 


Gradient  of  the  constraints 


3c, 

3c, 

3^“  ° 

ac, 

ax3  a0 

0C, 

3c2 

1 

3x2 

ac, 

.  **•  =  0 
a*3 

ac. 

ac, 

"3^"  ° 

ac3 

0X3 
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11.4  Simplified  Cyclostationary  Model 

Function  to  be  minimized 


r.y./8)  *  2Zils1  (ot-yXtfj-tSj)  +  Ei=1ffi  -e^(  '  ,)]-IZi=1  ,n  C«-rD  - 


-/toi-ts,.) 


Definitions 

x,  -  a  x2  »  Y  x3  =  p 
K2  and  K2  as  in  section  11.1. 


Objective  Function 


l(x )  =  (x,-x2)K2  +  ,[1*6  ^i]  -  H"= ,  ln[x1  -  x2[1  -  e  "A]] 


Constraints 

C,(x)  *  x1  >0 

C2(t)  *  X,  -X2>0 
C3(X)  -  x3>0 
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11.5  Periodic  Failure  Rate 

Function  to  be  minimized 

l(Sp,Cp)  *  ,  Sp[M(tfj)-M(tSi)]  +  £"=  !  ytfj-tSj)  -  ]C"=  ,  In [spfTlftfj)  +  cp] 


Definitions 


Kv  K2,  and  Kg  as  defined  in  Section  11.1 


Objective  Function 

l(x)  =>  x1K,  +  x2K2-E"a1  InCx,^  +  xa] 


Gradient 


Constraints 

C^x)  =  x,  minfK^}  +  x2>0 
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III.  Autocorrelation  function  estimation 


Given  an  ergodic  and  stationary  process  z(t),  the  problem  is  to  estimate  the  function 


-  “Too 


-  [  z(t  + 

W/t 


t  )z(t)  dt 


(Hl-1) 


For  a  finite  record  of  observed  values  z(n),  the  autocorrelation  function  is  usually  estimated  using  the 
expression 


Rzz(n)  *  ^2-i=,iz(i  +  n)z(') 


(HI-2) 


This  estimate  is  intuitive  except  for  the  factor  1/n.  Since  N-n  terms  are  summed,  it  seems  that  1/(N-n) 
would  be  more  exact.  In  fact  (111.2)  is  a  biased  estimator  of  the  real  autocorrelation  function.  However, 
its  expected  error  is  smaller  than  the  expected  error  that  would  be  obtained  using  the  (unbiased) 
estimator  with  factor  1/(N-n)  [Jenkins  68]. 

In  the  cases  presented  in  this  report  the  values  of  z(n)  are  not  directly  observable.  In  the  case  of 
sampling  the  values  of  fraction  of  time  in  Kernel  mode,  what  was  measured  was  the  average  fraction 
of  time  in  Kernel  mode  during  the  last  second,  recording  a  sample  every  five  minutes.  In  the  case  of 
the  number  of  blocks  accessed  to  the  file  system,  the  available  samples  are  the  number  of  blocks 
accessed  during  the  last  five  minutes,  also  with  a  resolution  of  five  minutes.  The  measured  values  are 
not  the  values  of  z(n),  but  the  values  of  the  process 


z’(n) 


z(t)  dt 


(IH.3) 


where  A  equals  five  minutes  or  one  second,  and  the  available  samples  of  z’(n)  are  five  minutes 
appart. 

It  has  been  observed  that  in  the  two  cases  studied  in  this  report,  the  autocorrelation  function 
suggests  an  approximation  of  the  form 


R«(t) 


ai® 


+  a2e 


(HU) 


The  problem  is  then  to  estimate  the  values  of  the  a,,  from  the  observed  values  of  z'(n).  If 
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Rz.z.(t)  »  <*,1e*lM  +  a'2e^ 


it  is  easy  to  show  that 


(111.5) 


aie 


■Pi*  0& 


a2e 


(111.6) 


where 


“i 


afcosht/Jj  A)  - 1] 


(111.7) 


The  problem  is  then  to  estimate  the  values  of  the  aV,  using  (111.2)  and  the  observed  values  of  z’(n), 
and  use  (111.7)  to  obtain  the  values  of  af  of  the  autocorrelation  function  of  z(t). 

Unfortunately  it  has  not  been  possible  to  follow  this  procedure.  The  accuracy  of  the  estimated 
autocorrelation  function  is  limited  basicly  by  two  factors  :  the  sampling  frequency  and  the  length  of 
the  available  record,  N.  Although  many  techniques  exist  for  power  spectrum  estimation  that  take  into 
account  these  two  factors  [Oppenheim  75]  (the  power  spectrum  is  the  Fourier  transform  of  the 
autocorrelation  function),  no  techniques  are  available  for  correcting  the  estimates  of  the 
autocorrelation  function  itself. 

If  the  sampling  frequency  is  comparable  to  the  bandwith  of  the  power  spectrum,  the  power 
spectrum  estimate  may  be  poor  due  to  aliasing.  Under  these  conditions,  the  estimate  of  the 
autocorrelation  function  given  by  (111.2)  may  take  negative  values.  This  is  precissely  what  happens  for 
the  estimated  autocorrelation  function  of  the  file  system  utilization  process  as  shown  in  Figure  3-2. 
For  a  sampling  frequency  equal  to  one,  the  bandwith  of  the  process  would  be  equal  to  =0.59,  that 
is,  the  sampling  frequency  is  not  even  twice  the  process  bandwith. 

The  solution  adopted  has  been  to  estimate  the  a.  and  ($■  directly  from  a  history  of  failures  as 
described  in  Section  4.1 .,  and  to  estimate  the  variance  of  the  process  <r^.  Since 


zz 


*  a.  +  a. 


(HI  8) 


and 
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52ji 

01 


0H.9) 


And  knowing  at ,  and  <r^,  the  values  of  the  a{  can  be  computed. 
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